Required citation: FAO. 2025. A course on applied data analytics for soil analysis with infrared spectroscopy. Rome.
Licensing Information
The designations employed and the presentation of material in this information product do not imply the expression of any opinion whatsoever on the part of the Food and Agriculture Organization of the United Nations (FAO) concerning the legal or development status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The mention of specific companies or products of manufacturers, whether or not these have been patented, does not imply that these have been endorsed or recommended by FAO in preference to others of a similar nature that are not mentioned.
The views expressed in this information product are those of the author(s) and do not necessarily reflect the views or policies of FAO.
ISBN xxx
© FAO, 2025
Some rights reserved. This work is made available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO licence (CC BY-NC-SA 3.0 IGO; https://creativecommons.org/licenses/by-nc-sa/3.0/igo/legalcode).
Under the terms of this licence, this work may be copied, redistributed and adapted for non-commercial purposes, provided that the work is appropriately cited. In any use of this work, there should be no suggestion that FAO endorses any specific organization, products, or services. The use of the FAO logo is not permitted. If the work is adapted, then it must be licensed under the same or equivalent Creative Commons licence. If a translation of this work is created, it must include the following disclaimer along with the required citation:
“This translation was not created by the Food and Agriculture Organization of the United Nations (FAO). FAO is not responsible for the content or accuracy of this translation. The original [Language] edition shall be the authoritative edition.”
Disputes arising under the licence that cannot be settled amicably will be resolved by mediation and arbitration as described in Article 8 of the licence except as otherwise provided herein. The applicable mediation rules will be the mediation rules of the World Intellectual Property Organization (WIPO Mediation Rules), and any arbitration will be conducted in accordance with the Arbitration Rules of the United Nations Commission on International Trade Law (UNCITRAL).
Third-party materials. Users wishing to reuse material from this work that is attributed to a third party, such as tables, figures, or images, are responsible for determining whether permission is needed for that reuse and for obtaining permission from the copyright holder. The risk of claims resulting from infringement of any third-party-owned component in the work rests solely with the user.
Sales, rights and licensing. FAO information products are available on the FAO website (www.fao.org/publications) and can be purchased through publications-sales@fao.org. Requests for commercial use should be submitted via: www.fao.org/contact-us/licence-request. Queries regarding rights and licensing should be submitted to: copyright@fao.org.
Editing and publication
Matteo Sala (FAO-GSP)
Andy Murray (FAO-GSP)
This document provides a comprehensive guide to the process of soil spectroscopic modelling, from data acquisition to interpretation, with a focus on using R for analytical modelling. It covers the entire workflow involved in generating reliable predictions of soil properties from spectral data, including data preprocessing, exploratory analysis, calibration, model training, and validation. Practical examples in R are provided throughout, illustrating key steps such as spectral transformation, principal component (PC) analysis, and multivariate regression techniques for predictive modelling. These steps help convert raw spectral data into actionable insights, allowing users to estimate soil attributes like organic carbon (C), clay content, and mineral composition. The document introduces well-known mathematical transfer functions, which correlate spectral wavelengths to soil properties, and how these models can be calibrated using conventional laboratory analysed reference data. Additionally, it explores the challenges of interpreting complex infrared spectra, especially the overlapping absorption bands in the vis-NIR region, and provides strategies to overcome these difficulties. It emphasizes the use of chemometric tools, such as partial least squares (PLS) regression, to extract meaningful relationships between spectral features and soil attributes. This document is the second in a series of three training materials designed to guide users through basic, intermediate, and advanced topics in soil vis-NIR and MIR spectroscopy. The content is aimed at equipping researchers, practitioners and soil laboratory technicians with the knowledge and tools needed to apply spectroscopic techniques for soil analysis, with a focus on practical implementation using R.
What is soil spectroscopy?
Soil spectroscopy is concerned with the qualitative and quantitative estimation of soil compounds with spectroscopic data. The visible and near-infrared (vis-NIR) and mid-infrared (MIR, also widely known simply as infrared or IR) regions of the electromagnetic spectrum are gaining interest in soil research. This is because many soil physical, chemical, and biological properties have distinct spectral signatures in these regions. These signatures are characterized by specific peaks, where the intensity is related to a soil compound at a specific band or a complex combination of spectral bands. Chemometric techniques are employed to model the relationships between these complex signatures and soil properties, and yield predictive models for the properties of interest. The modelling results and estimates of soil properties can be used in a series of applications.
Soil spectroscopy has several advantages over conventional laboratory methods for the estimation of soil attributes. These benefits have been documented extensively in the past years. Spectral measurements of soil are relatively rapid and inexpensive, and enable the estimation of several important soil attributes from a single spectral reading. For these reasons, spectroscopic analysis is usually more cost-effective than conventional laboratory methods. Further, preparing of soil samples for spectroscopy requires minimal efforts (i.e., usually 2 mm sieving and grinding for homogeneous particle size) and avoids the extensive use of chemical reagents needed for soil analysis, thus mitigating the overall environmental impact of a soil survey. Methods for soil spectral analysis are now being applied in the field through the use of proximal soil sensors, enabling in-situ soil scanning.
Figure @ref(fig:spectrumwavelength) shows the spectrum for all possible frequencies of electromagnetic radiation and the corresponding wavelength values. The electromagnetic spectrum ranges from high frequency \(\gamma\) rays (\(10^{34}\) Hz or wavelengths shorter than \(10^{-13}\) m) to radiowaves, which have long wavelength (\(10^8\) m) and low frequency (10 Hz). Commonly, the ultraviolet, visible, and infrared ranges are expressed in wavelength with unit micrometer (\(\mathrm{\mu}\)m) or nanometer (nm), whereas it is expressed in energy units, usually wavenumber in centimeter (cm\(^{-1}\)), for the mid- and far-infrared range. In the infrared (\(0.75-1,000\) \(\mathrm{\mu}\)m; \(750-10^{6}\) nm; \(12,500-10\) cm\(^{-1}\)) range, the radiation interacts with molecules contained in the soil material, causing the bonds to vibrate in different modes (such as stretching, bending, or rotating), which provide information on the soil molecular composition. For example, methyl and methylene groups in soil have a distinctive absorption area at \(2,920\) cm\(^{-1}\), which corresponds to the infrared radiation absorbed by the aliphatic asymmetric C-H stretch. Directed on a soil sample, the absorbed spectra at different wavelengths produce a characteristic shape which relates to the soil sample compounds and can be used for analytical purposes. A spectrum of a soil sample is therefore generated by recording the amount of radiation reflected by the sample material for all wavelengths in the region of study (e.g., the NIR). At those wavelengths where the signal is absorbed, the reflectance (\(R\)) radiation decreases. It is usual to transform the reflectance to apparent absorbance (\(A\)) through the relation \(A = \log_{10}(1/R)\). For more information, the reader is redirected to the Soil Spectroscopy Training Manual 1 (Ge, Wadoux, and Peng 2022).
Electromagnetic spectrum, from Wadoux et al. (2021) with permission..
The near-infrared (NIR) range is dominated by overtones and combinations (Figure 1.2) of fundamental vibrations that occur in the mid-infrared (MIR) range. These fundamental vibrations are conventionally defined in the MIR as symmetric stretching (\(\nu_1\)), symmetric bending (\(\nu_2\)), asymmetric stretching (\(\nu_3\)), and asymmetric bending (\(\nu_4\)). In the NIR, these vibrations are termed \(\nu\) for stretching and \(\delta\) for bending. Occasionally, features in the vis-NIR range arise due to electronic transitions (typically detecting cations of metals such as Fe and Ti). Molecular vibrations in the MIR have a range of progressively weaker vibrations and overtones in the NIR region. This means that NIR spectra have fewer, broader, and overlapping absorption bands than the MIR region. NIR spectra are thus more difficult to interpret.
Typically, both the vis-NIR and MIR ranges contain information on various organic and inorganic soil compounds. Soil carbon content (total C, inorganic C, organic C) has absorption in the infrared resulting from overtones of aliphatic C-H stretching, amide C=O stretching, C=H, and aromatic C=C stretching, among others. However, identification of several absorption features of organic functional groups is challenging because they overlap with absorbance of mineral components, particularly in the range 1400-800 cm\(^{-1}\). Absorption of clay minerals occurs due to hydroxyl O-H stretching and bending, while silicate Si-O stretching overlaps with bands of organic functional groups. Overtones in the NIR can result from fundamental vibrations of sulfate (SO4), carbonates (CO3), and metal-OH bonds. Overall, infrared spectra are strongly influenced by water, with absorption in the NIR due to a vibration of H–O–H and in the MIR where absorbance of other soil compounds occurs (e.g., at 3000 cm\(^{-1}\)). More details on absorption features are provided in the Discussion.
Soil mid-infrared spectra (top) with the X-H stretch, triple bond (TB), double-bond (DB) and fingerprint regions and visible and near-infrared spectra (bottom) showing where the first, second, third overtones (OT) as well as combinations occur.
The practical implications of the difference between vis-NIR and MIR discussed in the Introduction is that scanning soil in the vis-NIR range requires less sample preparation. As a result, vis-NIR spectrometry has been effectively deployed in-situ in the field with miniaturized spectrometers. Although these field instruments can be influenced by environmental conditions, such as soil moisture, they are not necessarily worse than laboratory spectrometers. Field spectrometers have become routine and provide valuable assistance in soil monitoring surveys. especially when a high density of soil observations is required. However, vis-NIR spectrometers are less sensitive to quartz (vis-NIR spectrometers are often composed of quartz lenses), and MIR spectrometers are to date almost exclusively restricted to use in the laboratory.
Soil attributes have often complex absorption patterns in the infrared range of the electromagnetic spectrum. The main frequencies of spectral absorptions in the infrared are often non-specific (i.e., they are not linearly related to a single soil property) and absorption bands overlap between properties. This is particularly significant in the vis-NIR range of the spectrum (José M. Soriano-Disla et al. 2014) which is more difficult to interpret than the MIR range. In addition, certain properties such as quartz are insensitive in the vis-NIR range, and the whole infrared range is compounded by scatter effect due to the surface soil sample surface properties. Several soil attributes (oxides, carbonates, organic matter, clay minerals) can be detected directly from the spectra, due to the presence of specific absorption features, while some other frequently reported properties (e.g., pH) are detected only because of their association to a combination of spectrally active components. To extract these complex absorption patterns and obtain quantitative estimates of a soil property, soil scientists have used mathematical transfer functions to correlate spectral wavelengths to soil attributes. The transfer function is calibrated using the spectral wavelengths as independent variables and the laboratory measured values of the soil attributes as the dependent variable. Once calibrated on the spectra, the soil property can be predicted using the spectral information only.
The pipeline to obtain quantitative information from soil infrared spectroscopic data.
The steps required for building a reliable spectroscopic model are shown in Figure @ref(fig:pipeline). Once the soil material is scanned, the spectral and physical-chemical data are loaded into the analytics software in the right and more convenient formats. Spectra need to be pre-processed using mathematical transformations that aim to improve their quality and remove undesirable noise. The pre-processed spectra can then be visualized, and techniques such as principal component analysis can be used for exploratory analysis. A subset of the soil samples will be sent to the laboratory for further analysis, which can be used for building the model. The modeling process includes data partitioning, model training, and validation. Finally, the model results are interpreted. All these steps are detailed in the following sections of this manual and the code with data are available on the FAO Github repository
This manual includes R code to demonstrate various
spectral data analysis techniques and methodologies. However, it does
not provide an introduction to the R programming language
itself. For those new to R, we recommend consulting the
open book “Introduction to R” (Douglas et al. 2014), which
offers a comprehensive guide to getting started with R.
Additionally, for users interested in utilizing the
tidyverse collection of packages (Wickham et al. 2019), which
are widely used for data manipulation and visualization, we suggest
referring to the book “R for Data Science” (Wickham, Çetinkaya-Rundel, and Grolemund
2023). These resources will provide the basic knowledge
necessary to effectively understand and apply the R code
presented in this manual.
Preparing and loading the data into the analytics software is an
important first step in our analysis. The way we prepare our data
partially determines how smooth the subsequent processes will be. Here
is an example of how we suggest to do that in R, using the
dat2.csv file which already contains soil physico-chemical
data of several samples along with their respective spectra:
# Since the dat2.csv is separated by commas we use the option sep = ","
# if your data is separated by tab for example, you can use sep = "\t"
# NOTE: the function read.table will look into the current working directory.
# Make sure to have the working directory set to the correct path using the
# function setwd('your/path/to/file/')
dat <- read.table('dat2.csv', sep = ",", header = TRUE)# lets rename the first column to sample as it indicates the sample index
colnames(dat)[1] <- "sample"
# show how is the current table organized: first 10 rows and first 10 columns
head(dat, c(10, 7))The table contains the sample name in the first column, followed by
the measured soil attributes Organic_Carbon,
Clay, Silt and Sand. The spectra
follows, starting with the sixth column called X350, which
corresponds to the wavelength at 350 nm.
Once we upload the data, we can organize it in a way that allows us
to quickly access the soil attributes and spectra separately. We
recommend to create a data.frame where each non-spectral
variable can be accessed directly by its name and all the spectral
variables can be called at once by using one single name. Let’s see how
this is done in R:
# This line searches for column names in the 'dat' dataset that match a specific pattern.
# The pattern "X[0-9]{2, 10}$" is a regular expression that:
# - "X": Matches the character 'X' at the start of each spectral column name.
# - "[0-9]{2, 10}": Matches between 2 and 10 digits (0-9) following the 'X'.
# - "$": Ensures that the match occurs at the end of the column name.
# The 'grep' function returns the indices of the column names that match this
# pattern.
# If in your data, the spectral variables start with another prefix, you should
# replace "X" with that prefix.
spectral_columns <- grep("X[0-9]{2,10}$", colnames(dat))Now that we know what the columns corresponding to the spectral variables are, we can just isolate the spectra into a matrix:
my_spectra <- as.matrix(dat[, spectral_columns])
# rename the columns to remove the X prefix
colnames(my_spectra) <- gsub("X", "", colnames(my_spectra))We can then readjust our original data as follows:
# Temporarily remove the spectral columns from the 'dat' dataset
dat <- dat[, -spectral_columns]
# Reassign the spectra to 'dat' as a single variable named 'spc_raw'
dat$spc_raw <- my_spectra
# finally remove the object "my_spectra"
rm(my_spectra)With the above code, we have organized our data into a clean
structure where the non-spectral variables are directly accessible by
their names in the dat data frame, and the spectral data is
consolidated into a single variable (spc_raw). This
structure facilitates further analysis by allowing easy access to both
types of data while keeping the dataset well-organized.
An additional recommendation related to the code style, is that all
R object names and variables be assigned in lowercase
letters. This practice helps prevent typos related to case sensitivity,
which can sometimes lead to hard-to-detect code issues or bugs.
For the sake of demonstration, we also load MIR spectra from the same dataset. Note here that the MIR spectra are loaded for visualization purposes and not used hereafter. We use the same procedure as for the vis-NIR data previously loaded.
We can visualize the spectra in R using various
packages, the simplest of which are the default base
packages with the function matplot. An alternative is to
use the ggplot2 package (Wickham 2016), which provides more
functionalities to visualize data. We want to create a
ggplot in R where each row of the dataset is a
spectrum and is represented as a line, and the lines are colored based
on the values of the column. One can follow the steps below. This code
uses the ggplot2 package in R. Note that the
first step is to read the dataset and prepare it for plotting. For
illustration purposes we plot the vis-NIR and the MIR spectra from the
dat2.csv and dat2Mir.csv datasets loaded and
pre-processed in the previous section.
# Install patchwork and ggplot2 in case these libraries are not yet installed
# install.packages("ggplot2")
# install.packages("patchwork")
library(ggplot2)
library(patchwork)
# get the wavelengths of the spectra from the column names and convert them
# into numeric values
my_wavelengths <- as.numeric(colnames(dat$spc_raw))
# get the wavenumbers of the MIR spectra from the column names and convert them
# into numeric values
my_wavenumbers <- as.numeric(colnames(dat$spc_raw_ir))
# For an advanced-looking plot, we will use ggplot
# First we create the long format data frame:
# - 'sample' is repeated for each wavelength to maintain the sample-spectra
# relationship.
# - 'oc' (Organic Carbon) is repeated for each wavelength to be used as a color
# mapping in the plot.
# - 'wavelength' is repeated for each sample, representing the measurement
# points.
# - 'absorbance' is the spectral data values converted to a vector and
# transposed to align with 'sample' and 'wavelength'
# NOTE: here we convert the NIR spectra from reflectance to (apparent) absorbance
# as this is the conventional way the spectral information is presented
vnir_long <- data.frame(
sample = rep(1:nrow(dat), each = ncol(dat$spc_raw)),
oc = rep(dat$Organic_Carbon, each = ncol(dat$spc_raw)),
wavelength = rep(my_wavelengths, nrow(dat)),
absorbance = as.vector(t(log(1 / dat$spc_raw, 10)))
)
ir_long <- data.frame(
sample = rep(1:nrow(dat), each = ncol(dat$spc_raw_ir)),
oc = rep(dat$Organic_Carbon, each = ncol(dat$spc_raw_ir)),
wavenumber = rep(my_wavenumbers, nrow(dat)),
absorbance = as.vector(t(dat$spc_raw_ir))
)
# We an now use ggplot to create our spectral plot
# Create the first plot
plot_vnir <- ggplot(
vnir_long,
aes(x = wavelength, y = absorbance, group = sample, color = oc)) +
geom_line(alpha = 0.5) + # Set alpha to 0.5 for transparency
scale_color_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(
x = "Wavelength (nm)",
y = "Calculated absorbace",
color = "Organic Carbon (%)"
) +
theme(legend.position = "top")
# Create the second plot
plot_ir <- ggplot(
ir_long,
aes(x = wavenumber, y = absorbance, group = sample, color = oc)) +
geom_line(alpha = 0.5) + # Set alpha to 0.5 for transparency
scale_color_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(
x = expression(Wavenumber~(cm^{-1})),
y = "Calculated absorbace",
color = "Organic Carbon (%)"
) +
scale_x_reverse() +
theme(legend.position = "none")
# Combine the two plots vertically
combined_plot <- plot_vnir / plot_ir
combined_plotSpectra coloured by the content of organic carbon of the samples.
The above code performs the following steps:
dat.csv.Organic_Carbon column for later use in
coloring the lines and then removes this column from the dataset.ggplot2.ggplot where each row (i.e. sample in this
case) of the dataset is represented as a line, with the lines colored
based on the organic carbon values. The colour gradient goes from blue
(low values) to red (high values).Applying mathematical pre-processing functions on the raw spectra is the first step, and possibly the most important, in spectral data analysis with chemometrics. When scanning the soil, the soil sample volume and preparation, the measurement method, and measuring parameters such as the choice of scanning time and scanning resolution add noise to the recorded spectral data. Further, when scanning a material with non-uniform particle sizes such as soils, the reflectance spectrum is often accompanied by scattering noise. In fact, the recorded spectrum is not only the result of surface reflection (Fresnel reflection); soil samples absorb radiation that penetrates beneath the surface. As a result, the recorded spectrum depends on the relative amount of radiation absorption and surface reflectance, and lost radiation that is not reflected back to the sensor. In short, the radiation that hits the soil sample is partly reflected, partly absorbed, and partly transmitted. The implication of this phenomenon is that the recorded reflectance spectra do not directly relate to the absorption. A set of simple transformations can be applied to obtain the absorbance spectra. The most basic is the transformation of the reflectance spectral matrix \(R\) by \(\log_{10}(1/R)\) or \(-\log_{10}(R)\). These transformations are embedded in all spectral data manipulation packages and software implementations.
For more detailed information on common preprocessing techniques,
readers can refer to the documentation of the prospectr
package (Stevens and Ramirez-Lopez
2014), which provides implementations for various smoothing,
scattering correction, and derivative methods for spectral
processing.
Before applying standard preprocessing techniques, it is essential to check if the spectrometer used to record the spectra produces splice artifacts. These discontinuities often occur in spectrometers equipped with multiple spectral detectors, such as those with separate detectors for the visible and NIR regions. For instance, some sensors may produce splices at 1000 nm and 1830 nm, leading to noticeable “steps” in the data where measurements from different detectors are combined. These steps can become problematic for modelling if not addressed, particularly when derivatives are employed.
Splice correction is a critical process that addresses these
artifacts by linearly interpolating the values at the edges of splice
regions, ensuring a smooth transition between the outputs of different
detectors. This correction minimizes the impact of detector transitions
on subsequent analysis, resulting in continuous and accurate spectral
data. The spliceCorrection function in the
prospectr package (Stevens and Ramirez-Lopez
2014) allows users to perform this correction by specifying
the splice positions and the number of interpolation bands needed for
the desired adjustment.
Let’s take another look at Figure @ref(fig:specplot). If we closely examine the region around the 1000 nm wavelength, we can observe a nearly imperceptible artifact that appears as a subtle vertical line. There is another, less visible artifact around the 1830 nm wavelength. We anticipated these issues because we know that our data was collected using a device with three detectors: the first records spectra up to 1000 nm, the second from 1001 nm to 1830 nm, and the third from 1831 nm onwards.
library(prospectr)
dat$spc <- spliceCorrection(dat$spc_raw, my_wavelengths, splice = c(1000, 1830))
spc_long <- data.frame(
sample = rep(1:nrow(dat), each = ncol(dat$spc)),
oc = rep(dat$Organic_Carbon, each = ncol(dat$spc)),
wavelength = rep(my_wavelengths, nrow(dat)),
reflectance = as.vector(t(dat$spc))
)
# We an now use ggplot to create our spectral plot
ggplot(
spc_long,
aes(x = wavelength, y = reflectance, group = sample, color = oc)) +
geom_line(alpha = 0.5) + # Set alpha to 0.5 for transparency
scale_color_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(
x = "Wavelength (nm)",
y = "Reflectance",
color = "Organic Carbon (%)"
) +
theme(legend.position = "top")Splice-artifact-free spectra colored by the organic carbon content of the samples.
Figure @ref(fig:specplotsplice), shows that the artifact initially observed in Figure @ref(fig:specplot) has been removed.
Smoothing of the spectra is typically used to reduce noise. Examples of methods to remove noise in the spectra include moving window average and Savitzky-Golay filtering. These techniques help in smoothing out fluctuations and improving the clarity of the spectral data. It should be noted that while smoothing is essential, it may need to be carefully balanced to avoid losing important spectral features. Additionally, derivatives can be applied to the spectra to accentuate absorbance features and remove both additive and multiplicative effects. However, derivatives tend to amplify noise, so smoothing is often recommended before applying derivatives.
In certain cases, specific wavelength portions of the spectra may exhibit a high signal-to-noise ratio, while the rest of the spectra show a lower signal-to-noise ratio. In these scenarios, it may be beneficial to simply remove those portions from our analysis. For example, the edges of the spectra (typically between 400 nm to 500 nm and from 2450 nm to 2500 nm in the NIR range), tend to be noisy. Trimming these regions can help improve subsequent spectra analyses.
Moving window average
The moving window average, also known as the rolling or running average, is a simple smoothing technique used in signal processing. It involves taking the average of a subset of data points within a specified window that moves across the data set. For each position of the window, the average of the data points within it is calculated, and this process is repeated as the window moves through the entire dataset. This technique helps in reducing noise and making the underlying trends in the data more apparent.
Savitzky-Golay Filtering
The Savitzky-Golay filter is a signal filtering technique that applies polynomial smoothing to data. Unlike the moving window average, which replaces each data point with the average of its neighboring points, the Savitzky-Golay filter fits successive subsets of adjacent data points with a low-degree polynomial using linear least squares. When the polynomial degree is zero, the Savitzky-Golay filter behaves like the moving window average. The primary advantage of this method is its ability to preserve important features of the data distribution, such as relative maxima, minima, and peak width, which are often “flattened” by other averaging techniques. The equation for the Savitzky-Golay filter is:
\[ \hat{y}_{i} = \frac{1}{n}\sum_{j=-h}^{h}c_{j} \ y_{i+j}, \]
where \(n\) is the number of points in the window, \(\hat{y}_{i}\) is the smoothed value of the spectrum at wavelength \(i\), \(y_{i+j}\) are the spectrum values within the window centered at \(i\), \(h\) is half the width of the window (i.e., the window size is \(2h + 1\)), and \(c_{j}\) is the coefficient of the polynomial.
Note that Savitzky-Golay filtering is also used for computing signal derivatives. This is explained in the upcoming section on derivatives, where the filter’s ability to smooth data while preserving derivative information is particularly valuable.
Similarities between moving window average and Savitzky-Golay
Implementation in R
The prospectr package in R is designed for
the processing and analysis of spectral data, including soil infrared
spectroscopic data. It includes functions for both moving window average
and Savitzky-Golay filtering.
To illustrate the effect of smoothing, let’s use one single spectrum. Let’s take the first spectrum in our dataset and add some (normally distributed) random noise:
# Set a seed for reproducibility of random noise
set.seed(801124)
# Add random noise to the first sample's spectrum
my_noisy_spc <- dat$spc[1, ] + rnorm(ncol(dat$spc), sd = 0.0025)Now let’s apply the smoothing pre-treatmets to our noisy spectra:
library(prospectr)
# Replace 'my_noisy_spc' with your actual data frame or matrix and 'w' with the desired
# window size
mwa_result <- movav(my_noisy_spc, w = 11)Now Savitzky-Golay filtering with prospectr:
# Replace 'my_noisy_spc' with your actual data frame or matrix, 'm' with the window size,
# and 'p' with the polynomial order
sg_result <- savitzkyGolay(
my_noisy_spc, # the noisy spectrum
m = 0, # this m is for the derivative order, 0 is for no derivative
p = 3, # the polynomial order
w = 11 # the window size
)In the above examples, my_noisy_spc represents your
input spectra, i.e., the spectra that you want to smooth, w
represents the window size used for smoothing, and p
represents the polynomial order used in the Savitzky-Golay filter. Let
us plot the different spectra to observe the effects of smoothing:
# Create a long-format data frame for the MVA spectrum
long_mwa_spc <- data.frame(
wavelength = as.numeric(names(mwa_result)), # Extract the wavelengths
spectral_value = mwa_result, # Store the spectral values from the MVA
method = "mva" # Label the data
)
# Create a long-format data frame for the Savitzky-Golay (SG) spectrum
long_sg_spc <- data.frame(
wavelength = as.numeric(names(sg_result)), # Extract the wavelengths
spectral_value = sg_result, # Store the spectral values from the SG
method = "sg" # Label the data
)
# Combine the long-format MVA and SG data frames into one
denoised_long <- rbind(long_mwa_spc, long_sg_spc)
# Create a data frame for the original noisy spectrum
original_noisy <- data.frame(
wavelength = my_wavelengths, # Use the original wavelengths
spectral_value = my_noisy_spc # Use the original noisy spectrum values
)
my_labels <- c(
"mva" = "Moving average smoothing",
"sg" = "Savitzky-Golay filtering"
)
# Plot the spectra using ggplot2
ggplot() +
# Plot the original noisy spectrum as a red line in the background
geom_line(
data = original_noisy,
aes(x = wavelength, y = spectral_value),
color = "black"
) +
# Plot the MVA and SG processed spectra on top of the original noisy spectrum
geom_line(
data = denoised_long,
aes(x = wavelength, y = spectral_value, color = method, group = method)
) +
# Add titles and labels for the axes
labs(
x = "Wavelength (nm)",
y = "Reflectance"
) +
# Manually set the colors for the MVA and SG lines
scale_color_manual(
values = c("mva" = "#00AFBB", "sg" = "#E7B800"),
labels = my_labels
) +
# Facet the plot by the method, creating separate plots for MVA and SG
facet_grid(. ~ method, labeller = labeller(method = my_labels)) +
theme(legend.position = "top") + guides(color = guide_legend(title = NULL))Example of smoothing a noisy spectrum.
With Figure @ref(fig:preproexample), which illustrates the effects of different preprocessing methods, we can clearly see that with the previous code, the noise pretreated spectra becomes less noisy than the original spectrum. Figure @ref(fig:preproexample) shows three spectra: the original spectrum (black line), the MVA-smoothed spectrum (cyan line), and the SG-smoothed spectrum (yellow line). The x-axis represents the wavelength (0 to 2500 nm), and the y-axis shows the value (0 to 0.4).
When applying smoothing techniques like moving average (MVA) and Savitzky-Golay (SG) filtering, selecting the appropriate window size and polynomial order is of critical importance. Large window sizes can lead to over-smoothing, flattening the signal and potentially removing valuable soil information. Conversely, smaller window sizes preserve more details but might not adequately reduce noise. Higher polynomial orders in SG filtering can capture more complex trends but may also fit the noise, making the filtering less effective. On the other hand, low polynomial orders combined with large window sizes might overly smooth the data, suppressing important spectral information.
To balance noise reduction and signal preservation, typical window sizes range from 5 to 15 nm, and polynomial orders from 2 to 6, with 3 being common for soil spectroscopy. Experimentation and cross-validation are often needed to determine the optimal parameters for your dataset.
After noise removal, another important pre-processing step is scatter correction, which aims at enhancing the signal and correct baseline shifts that may occur due to physical characteristics of the soil, such as particle size distribution. Scatter correction techniques are designed to reduce variability between samples caused by scatter and adjust for baseline shifts. The most commonly used methods for scatter correction in NIR spectroscopy include standard normal variate (SNV) and multiplicative scatter correction (MSC).
Standard Normal Variate (SNV) is a simple and widely used method that works by centering each individual spectrum to zero and then dividing each spectral band value by the standard deviation of the entire spectrum. The SNV method processes each observation independently. The disadvantage of SNV is that it can be sensitive to noise.
Multiplicative Scatter Correction (MSC) is also a widely used pre-processing technique which involves two main steps: estimating correction coefficients (additive and multiplicative contributions) and then applying these corrections to the spectrum or spectra to be corrected (Rinnan, Van Den Berg, and Engelsen 2009). The purpose of MSC is to remove undesirable scatter effects from the data matrix before further modeling. The typical approach uses the average spectrum of the calibration set as the reference spectrum, although a generic reference spectrum can also be employed. MSC can be extended to include higher-order polynomial fitting or incorporate a priori knowledge, but in most practical applications, the basic form of MSC is sufficient (Rinnan, Van Den Berg, and Engelsen 2009). This method effectively preserves spectral features while reducing background offsets and slopes.
In many cases, the choice between MSC and SNV will depend on the specific characteristics of the data and the intended use of the processed spectra. Both techniques are highly effective at reducing scatter-induced variability, and their application is often preceded by smoothing methods like Savitzky-Golay filtering to ensure that the underlying trends in the data are preserved. In addition, because MSC uses a mean or some reference spectrum, it might not be so suitable if one is performing operations across disparate spectral libraries, in which case SNV would be preferred.
Implementation in R
For SNV:
# Assume 'dat$spc' is your spectral matrix, where each row is a spectrum
# Apply SNV to each spectrum
snv_spectra <- standardNormalVariate(dat$spc)For MSC:
Spectral derivatives are considered a standard tool in spectroscopic analysis. They are used to remove both additive and multiplicative effects in the spectra. They help enhance small spectral absorptions, resolve overlapping absorption features, and compensate for baseline shifts and instrument drift. However, derivatives can also lead to noise amplification and make the spectral interpretation more complex, so a careful application is necessary. Here we describe the basic method used for computing the derivatives: the Savitzky-Golay method and the Norris-Gap derivatives, referred hereafter to as Norris-Williams.
The Savitzky-Golay (SG) method is widely used for calculating spectral derivatives. In this method, a polynomial is fitted to a subset of data points within a moving window centered around each point in the spectrum. The derivative of this (noise-free) polynomial is then computed at the center of the window. This method allows for the estimation of first or higher-order derivatives, preserving the integrity of the spectral signal.
To apply the Savitzky-Golay method in R we can use the
savitzkyGolay() function from the prospectr
package:
# Apply Savitzky-Golay smoothing and compute the first derivative
sg_first_derivative <- savitzkyGolay(
dat$spc, # The spectral data
m = 1, # The derivative order (first derivative)
p = 2, # The polynomial order (2 i.e fitting a quadratic polynomial)
w = 11 # The window size
)
# Apply Savitzky-Golay smoothing and compute the second derivative
sg_second_derivative <- savitzkyGolay(
dat$spc, # The spectral data
m = 2, # The derivative order (second derivative)
p = 2, # The polynomial order (2 i.e fitting a quadratic polynomial)
w = 11 # The window size
)In the above code, p represents the polynomial order,
m is the derivative order (with 1 for the first derivative
and 2 for the second), and w is the window size. The
advantage of the Savitzky-Golay method is that it smooths the data while
computing derivatives.
This method is also known as Norris-Williams derivatives method. This is another technique used to calculate spectral derivatives while managing noise. This method involves two steps: smoothing the spectra over a specified window and then calculating the derivative with a defined gap between points. The gap size helps to reduce noise further, making the method particularly useful for spectral data with high co-variation (Rinnan, Van Den Berg, and Engelsen 2009).
To apply the gap-Segment derivative method in R we can
use the gapDer() function from the prospectr
package:
# Apply gapDer smoothing and compute the first derivative
gap_first_derivative <- gapDer(
dat$spc, # The spectral data
m = 1, # The derivative order (first derivative)
w = 11, # The window size
s = 3 # The segment size
)
# Apply gapDer smoothing and compute the second derivative
gap_second_derivative <- gapDer(
dat$spc, # The spectral data
m = 2, # The derivative order (second derivative)
w = 11, # The window size
s = 3 # The segment size
)Now let’s plot the first and second derivatives of the first spectrum from our dataset:
library(ggplot2)
# Create the long-format data frame for the different derivatives
long_sg_1der_spc <- data.frame(
wavelength = as.numeric(colnames(sg_first_derivative)), # the wavelengths
spectral_value = sg_first_derivative[1, ], # Store the derivatives
method = "Savitzky-Golay derivative", # Label the data
derivative_order = "1st"
)
long_sg_2der_spc <- data.frame(
wavelength = as.numeric(colnames(sg_second_derivative)), # the wavelengths
spectral_value = sg_second_derivative[1, ], # Store the derivatives
method = "Savitzky-Golay derivative", # Label the data
derivative_order = "2nd"
)
long_gap_1der_spc <- data.frame(
wavelength = as.numeric(colnames(gap_first_derivative)), # the wavelengths
spectral_value = gap_first_derivative[1, ], # Store the derivatives
method = "Gap-Segment derivative", # Label the data
derivative_order = "1st"
)
long_gap_2der_spc <- data.frame(
wavelength = as.numeric(colnames(gap_second_derivative)), # the wavelengths
spectral_value = gap_second_derivative[1, ], # Store the derivatives
method = "Gap-Segment derivative", # Label the data
derivative_order = "2nd"
)
# Combine all derivatives into one data frame
derivatives_long <- rbind(
long_sg_1der_spc, long_sg_2der_spc, long_gap_1der_spc, long_gap_2der_spc
)
# Plot the derivatives using ggplot2 with facets for each method
ggplot(
derivatives_long, aes(x = wavelength, y = spectral_value, color = method)
) +
geom_line() +
labs(
x = "Wavelength (nm)",
y = "Derivative value"
) +
facet_wrap(derivative_order ~ method, scales = "free_y") +
theme(legend.position = "none") # Remove the legend as it's redundant with facetsExamples of the derivatives for the first spectrum in the dataset.
Figure @ref(fig:derplots) shows that while the Savitzky-Golay method captures detailed spectral features, in this particular case it tends to introduce more noise in the second derivative. The Gap-Segment method, on the other hand, provides a cleaner result with less noise in both the first and second derivatives, making it a better choice for applications requiring noise reduction in higher-order derivatives.
Resampling is a technique used to adjust the spectral data to new
spectral band positions, either to increase or decrease the resolution
or to align the data with a different set of wavelengths. The
resample function in the prospectr package
facilitates this process by using either spline or linear interpolation
methods. We can specify the original band positions and the desired new
band positions, with the function returning a resampled matrix or vector
that aligns with the new spectral coordinates. This method is
particularly useful when spectral data needs to be standardized or
harmonized for comparison across different datasets or instruments.
For the rest of this tutorial we will resample the spectral data so
they have a resolution every 5 nm. This can be done with the
resample function:
new_wavs <- seq(from = min(my_wavelengths), to = max(my_wavelengths), 5)
# resample and over-write the original spectral matrix
dat$spc <- resample(
dat$spc,
wav = my_wavelengths,
new.wav = new_wavs
)Note that with the above code, the number of spectral variables is reduced from 2151 to 431.
Principal Component (PC) analysis is a statistical technique used for dimensionality reduction while preserving as much of the variability in the data as possible. It is particularly useful in processing and analyzing high-dimensional data such as soil spectroscopy data, where each sample might be represented by a large number of variables (i.e., spectral bands). By applying PC analysis, we reduce the spectra data to a few principal components that capture most of the variability in the spectra, facilitating the identification of key patterns related to soil properties. The PCs are used as the basis for a large number of applications, such as for outlier detection and modeling.
Let us consider a spectral dataset represented by the matrix \(\mathbf{X}\) of size \(n \times p\), where \(n\) is the number of samples (spectra) and \(p\) is the number of spectral bands. PC analysis aims to find a transformation that converts the original spectra dataset into a new set of variables, called principal components, which are linear combinations of the original variables. The transformation can be represented as:
\[\mathbf{Z} = \mathbf{X} \mathbf{P}\],
where \(\mathbf{Z}\) is the matrix of scores (coordinates of the original data in the new principal component space), \(\mathbf{X}\) is the centered (i.e., mean-subtracted) data matrix of spectra, and \(\mathbf{P}\) is the matrix of loadings (eigenvectors of the covariance matrix of \(\mathbf{X}\)). The columns of \(\mathbf{P}\) are the principal component loadings ordered by their corresponding eigenvalues in descending order. The first principal component is the direction that maximizes the variance of the data, the second principal component is the direction orthogonal to the first that maximizes the remaining variance, and so on. The last several components usually represent noise in the data and account for a very small portion of the explained variance.
Loadings: In PC analysis, loadings represent the weights assigned to each original variable in the calculation of the principal components. They indicate how much each original variable contributes to its corresponding principal component. Loadings can reveal patterns and relationships among the original variables.
Scores: The scores are the transformed coordinates of the original data in the principal component space. The scores indicate the position of each observation in the new space defined by the principal components. They can be used to identify similarities and differences among observations, as well as potential outliers.
To perform PC analysis on soil spectral data in R and
visualize the results using the ggplot2 library, one needs
to follow these steps:
Here we perform mean centering and, if necessary, apply variance scaling to the spectra.
Mean centering involves subtracting the mean spectrum of the entire dataset from each individual spectrum. This step ensures that PC analysis identifies the variance of the spectra relative to their mean, focusing on differences rather than absolute values. Without centering, PC analysis might incorrectly interpret the overall mean as a significant source of variation.
Variance scaling (or standardizing) ensures that each spectrum has unit variance by dividing the values of each spectrum by its standard deviation. This step is necessary to ensure that PCA captures variation due to differences in the spectral patterns rather than differences in scale. If the spectra are not scaled, variables with larger scales could dominate the variance captured by PCA, potentially leading to misleading results.
By centering and scaling the spectra, PC analysis can effectively reveal the underlying structure of the data, focusing on the most significant sources of variation and making it easier to interpret the results.
Here we apply PC analysis using, for example, the
prcomp() function from base R.
Here, we create visualizations to interpret the results of the PC
analysis. Using the ggplot2 library, we can plot the
scores to explore the relationships and clustering
among the samples in the new principal component space. This helps in
identifying patterns, such as grouping of similar samples, trends, or
potential outliers. Additionally, we can plot the
loadings to understand how the original spectral
variables contribute to the principal components. This helps in
identifying which wavelengths are most influential in explaining the
variability within the dataset.
# Perform PC analysis on the spectra processed with scatter correction (SNV)
# and Savitzky-Golay smoothing
# The combination of SNV and Savitzky-Golay helps to remove scatter effects
# and smooth the spectra, preventing artifacts from dominating the principal components.
# This ensures that the PC analysis focuses on the underlying spectral patterns
# relevant to soil attributes.
pca_result <- prcomp(
savitzkyGolay(standardNormalVariate(log10(1 / dat$spc)), m = 1, p = 1, w = 3),
center = TRUE, scale. = TRUE
)Note that in the above code we apply the Savitzky-Golay derivative and SNV transformation before the PC analysis. This helps mitigate the impact of future unseen baseline shifts and scattering effects, which can arise after changes in the optical components of the spectrometer or significant alterations in scanning conditions. These preprocessing steps ensure that PC analysis focuses on the true spectral patterns, making the resulting models more robust to such variations in future measurements.
The code above performs PC analysis of the pretreated spectra of
dat, scaling the data before the analysis, and stores the
result in pca_result.
# Plotting the scores of the first two principal components
scores <- as.data.frame(pca_result$x)
ggplot(scores, aes(x = PC1, y = PC2)) +
geom_point(alpha = 0.5) +
labs(x = "PC 1", y = "PC 2")Scatter plot of the scores of the first two principal components.
The previous code creates a scatter plot of the first two principal components (PC1 and PC2). In Figure @ref(fig:myscores) each point represents a spectrum from the original dataset, with coordinates determined by the scores of the first two principal components. PC1, plotted on the x-axis, captures the largest variance in the data, while PC2, on the y-axis, captures the second largest variance. The spread of points along these axes indicates the variability among observations, with some distant points that can potentially be considered as outliers.
# Extract the loadings
loadings <- pca_result$rotation
# Create a data frame for the first three loadings
loadings_df_long <- data.frame(
component = rep(paste0("PC", 1:ncol(loadings)), each = nrow(loadings)),
wavelengths = rep(as.numeric(rownames(loadings)), ncol(loadings)),
value = as.vector(loadings)
)
# Plotting the first three loadings
ggplot(
loadings_df_long[loadings_df_long$component %in% c("PC1", "PC2", "PC3"), ],
aes(x = wavelengths, y = value, color = component, group = component)
) +
geom_line() +
labs(x = "Wavelength (nm)", y = "Loading value") +
scale_color_manual(
values = c("PC1" = "#00AFBB", "PC2" = "#E7B800", "PC3" = "#FC4E07")
) +
theme(legend.position = "top") + guides(color = guide_legend(title = NULL))First three loading spectra resulting from our PC analysis.
The above code creates a plot of the first three PC loadings. The colours for the loadings are manually set. Figure @ref(fig:loadingsplot) shows how each wavelength contributes to the first three principal components (PC1, PC2, and PC3). The x-axis represents the wavelengths, while the y-axis shows the loading values, reflecting the strength and direction of the contribution of each wavelength. The plot includes three lines, each representing one of the principal components, with blue for PC1, yellow for PC2, and red for PC3. This visualization helps in understanding the influence of each spectral band on the principal components, elucidating the underlying structure and relationships within the data.
The number of components to use for any subsequent analysis can be determined using a scree plot. A scree plot displays the eigenvalues, or the amount of variance explained by each principal component, on the y-axis, with the components ordered by their eigenvalues on the x-axis. Typically, the plot reveals a distinct point where the magnitude of the eigenvalues drops off, commonly referred to as the “elbow.” This elbow point is an indicator of the optimal number of components to retain, as the components to the left of the elbow explain a significant portion of the variance, while those to the right contribute less and are often considered noise. The rationale behind this method is to select a subset of components that capture most of the spectral data variation while avoiding the inclusion of components that provide minimal information.
When creating a scree plot in R, it is possible to
calculate the variance explained by each principal component and plot it
against the number of components. In this example, we visualize only the
first 20 components in the scree plot.
# Calculate variance explained by each component
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
# Create a data frame for plotting (keep the first 20 components only)
components <- data.frame(Component = 1:20, Variance = var_explained[1:20])
# Create the scree plot
ggplot(components, aes(x = Component, y = Variance)) +
geom_line() + geom_point() +
scale_x_continuous(breaks = 1:length(var_explained)) +
labs(x = "PC", y = "Explained spectral variance") +
theme_minimal()Scree plot for our PC analysis.
Figure @ref(fig:screeplot) suggests that the first component is the most important, it explains about 47% of the variance, followed by the second and third components which also contribute to explaining the variance of the original spectral dataset but to a small extent. After the tenth component, the additional components contribute very little to explaining the variance. Based on this scree plot, one might decide to retain the first nine or ten principal components for further analysis. Together they explain about 96% of the original variance.
Typically, when PC analysis is applied to raw spectra, a large portion of the original spectral variance is captured in the first few principal components, often the first three. This can be due to the presence of “hidden” spectral features that are not yet fully revealed. If the spectra are rich in information, it is unlikely that the full variation is captured in just three components. Therefore, signal enhancing pre-processing usually results in a larger number of components that capture the complexity and richness of the spectral data.
A biplot is a summary figure that can be used to visualize the results of PC analysis. It simultaneously displays information on both the scores and the loadings. A biplot helps to visualize how strongly each wavelength (or wavenumber) influences the principal components and how observations are distributed in the principal component scores space. In a biplot, each point represents a single spectrum from the dataset. The distribution of the points in the biplot indicates the similarity among the spectra. The vectors represent the loadings, where the direction of a vector indicates the contribution of a wavelength to the principal component, and the length of the vector reflects the strength of this contribution.
# Adding row names as a column for plotting
loadings <- as.data.frame(loadings)
scores <- as.data.frame(scores)
loadings$variable <- rownames(loadings)
scores$observation <- rownames(scores)
# Filter loadings_df to include only selected variables
selected_variables <- c("400", "1250", "2300")
filtered_loadings_df <- loadings[loadings$variable %in% selected_variables, ]
# Plotting observations
gg <- ggplot() +
geom_point(
data = scores, aes(x = PC1, y = PC2, color = "observations"), alpha = 0.5
) +
scale_color_manual(
values = c("observations" = "blue", "variables" = "red")
) +
theme_minimal()
# Adjusting arrow length
gg <- gg + geom_segment(
data = filtered_loadings_df,
aes(x = 0, y = 0, xend = PC1 * 1000, yend = PC2 * 1000, color = "variables"),
arrow = arrow(type = "open", length = unit(0.2, "inches"))
)
# Adding selected variable labels
gg <- gg + geom_text(
data = filtered_loadings_df,
aes(x = PC1 * 1000, y = PC2 * 1000, label = variable),
hjust = 1.2, vjust = 1.2
)
# Increase the limits of the axes
gg <- gg + xlim(-75, 75) + ylim(-75, 75) +
theme(legend.position = "top") + guides(color = guide_legend(title = NULL))
ggBiplot of the first two principal components with selected variable loadings.
The biplot (Figure @ref(fig:biplot)) shows how individual spectra are distributed across the first two principal components (PC1 and PC2). The wide spread of points along the PC1 axis indicates that PC1 captures a substantial portion of the variance in the original data. Spectra that are close together in the plot are more similar to each other, while those further apart show greater differences in their spectral characteristics.
The red vectors represent specific wavelengths (400, 1250, and 2300 nm), showing how these wavelengths contribute to the principal components. The direction of each vector indicates the influence of that wavelength on the principal components, while the length of the vector reflects the strength of that influence. Spectra that are positioned close to a vector (e.g., near the 2300 nm vector) likely have stronger absorption features associated with that particular wavelength.
Given that spectroscopic measurements are generally fast and cost-effective, it is feasible to scan all available soil samples across the visible, NIR, and MIR regions, or combinations thereof. However, when calibrating spectroscopic models, both the spectra and reference values for soil attributes obtained through conventional methods are required. Due to the higher costs and longer time needed for conventional analyses, it is important to carefully select a representative subset of soil samples, referred to as the calibration set, to be sent to the laboratory (Wadoux et al. 2021).
The size and selection method of this calibration set are critical to the quality of the resulting model that relates soil attributes to spectra (Ramirez-Lopez et al. 2014). Several sampling designs can be used, including simple random sampling and more sophisticated methods like k-means clustering, which involves creating unsupervised clusters within the spectral space and sampling from each cluster. Another effective approach is conditioned Latin hypercube sampling (Budiman Minasny and McBratney 2006), which aims to uniformly cover the empirical distribution of the spectra while maintaining the correlation structure among spectral variables, typically represented by principal component scores.
In all sampling design approaches, the spectra are first summarized into a few components using dimensionality reduction techniques such as PC analysis. The sampling design is then applied to the first few principal component scores. This way, the spectral dataset, which is often high-dimensional (many spectra and thousands of wavelengths), is summarized into a lower-dimensional space (see also the previous section).
The number of samples to be sent to the laboratory should, ideally, be as small as possible to save costs, but without compromising the accuracy of the subsequent chemometric models. As a simple rule, the representativity of a subsample for a given sampling design can be assessed by computing the difference between the probability density function of the subsample and that of the entire set of available spectra. Examples in the literature include Ramirez-Lopez et al. (2014) and Nawar and Mouazen (2018). Here, we do not provide examples of such approaches because the number of samples to be sent to the laboratory is usually guided by the cost of the analysis.
Kennard-Stone sampling (Kennard and Stone 1969) is a method used to select a subset of samples from a larger dataset in such a way that the selected samples cover the variability of the entire dataset as much as possible. The Kennard-Stone algorithm works by first selecting the two samples that are the most distant from each other in the dataset based on Euclidean distance. Subsequent samples are selected iteratively, each time choosing the sample that has the maximum minimum distance to all previously selected samples. This process continues until the desired number of samples has been selected.
In R this can be performed using the
kenStone() function of the prospectr package.
Recall that this is done in the PC scores space. We assume that we want
to select 30 samples for laboratory analysis.
# the target subset size to be selected
nn = 30
# the number of PCs to use
n_pcs <- 10
# select the samples with Kennard Stone
selected <- kenStone(
pca_result$x[, 1:n_pcs], # the input are the scores of the PC
k = nn, # number of samples to select
metric = 'mahal', # the distance metric
.center = TRUE,
.scale = FALSE
)
# show the results. Here we are only interested on the row with the 'model' name.
str(selected)## List of 2
## $ model: int [1:30] 203 102 289 39 77 103 281 99 275 227 ...
## $ test : int [1:270] 1 3 4 5 6 7 8 9 10 11 ...
We proceed to the next step: plotting the selected samples alongside the full dataset using ggplot2 to visualize the results in 2D.
scores$kss_selected <- FALSE
scores$kss_selected[selected$model] <- TRUE
# Plot using ggplot2
ggplot(
scores[!scores$kss_selected, ],
aes(x = PC1, y = PC2, color = kss_selected, shape = kss_selected)
) +
geom_point(size = 2, alpha = 0.7) +
geom_point(size = 3, data = scores[scores$kss_selected, ]) +
theme_minimal() +
labs(x = "PC1",
y = "PC2",
color = "",
shape = "") +
scale_color_manual(
values = c("TRUE" = "red", "FALSE" = "black"),
labels = c("TRUE" = "Selected samples", "FALSE" = "Not selected samples")
) +
scale_shape_manual(
values = c("TRUE" = 16, "FALSE" = 16),
labels = c("TRUE" = "Selected samples", "FALSE" = "Not selected samples")
) +
theme(legend.position = "top") Kennard-Stone sampling (on the principal component space) for selecting the calibration subset.
The provided code snippet uses the ggplot2 library to
create a scatter plot of the results from the PC analysis. The code
identifies the selected samples based on a model from a data frame
called scores. It then filters the scores data
frame to create separate subsets: one for all points and another for the
selected points, which are identified using the Kennard-Stone algorithm.
The scores data frame contains columns for the first and
second principal components (PC1 and PC2) and a logical column
indicating whether each point is selected.
The plot (Figure @ref(fig:kssampling)) is generated by first plotting the non-selected points with a smaller size and reduced opacity, and then overlaying the selected points with larger, more prominent markers. The non-selected points are displayed as black solid circles, while the selected points are shown as larger red circles.
The resulting image from this code presents a scatter plot in which the majority of the data points (not selected) are displayed in black, with the selected points highlighted in red. This visualization effectively shows the distribution of the selected samples within the principal component space, emphasizing their placement relative to the entire dataset, which is particularly useful for understanding the selection made by the Kennard-Stone algorithm during the PC analysis.
Conditioned Latin hypercube sampling (cLHS, Budiman Minasny and McBratney (2006)) is a sampling design that aims to cover the multidimensional space of the spectral data. This is a similar objective to Kennard-Stone, but it is achieved differently. In cLHS, the variables (here, the PC scores) are divided into intervals, called strata. The number of strata is equal to the number of samples to be selected. This division is done for all variables (i.e., scores). In cLHS, the objective is to sample within each stratum. An optimization algorithm with an objective function is used to achieve this goal.
The objective function in cLHS has two main goals. The first goal is to ensure that all strata are covered by a sampling point in the multidimensional space. The second goal is to account for the correlation between variables. To optimize these criteria, simulated annealing is commonly employed, although many other optimization algorithms are available. Simulated annealing works by testing various sampling solutions and retaining the ones that provide the best value for the objective function. This iterative process continues until a predefined convergence criterion is met, such as reaching a specific number of iterations.
cLHS is a popular sampling design in spectroscopy and is also widely
used in environmental sciences. We can implement cLHS on the
spectroscopic data using the clhs package (Roudier et al.
2019).
library(clhs) # Load the cLHS package for Conditioned Latin Hypercube Sampling
# Set the seed for reproducibility of the random sampling process
set.seed(801124)
# Perform Conditioned Latin Hypercube Sampling (cLHS)
selected_clhs <- clhs(
x = scores[, 1:n_pcs], # Use the first n_pcs principal components
size = nn, # Specify the number of samples to be selected (nn)
iter = 1000, # Set the number of iterations for simulated annealing
simple = FALSE
)
# Generate plots to visualize the cLHS results,
# typically including diagnostics like the objective
# function's progress over iterations
plot(selected_clhs) cLHS sampling (on the principal component space) for selecting the calibration subset.
The Figure @ref(fig:clhsampling) illustrates the changes in the objective function values throughout the iterations of the simulated annealing algorithm. This plot provides a visual assessment of how effectively the objective function was minimized during the optimization process. Initially, there is a sharp decrease in the objective function value, which gradually levels off as the number of iterations increases. This pattern suggests that the algorithm made significant progress early on, with diminishing improvements as it approached an optimal or near-optimal solution. Such a figure is commonly used to monitor the convergence of an optimization algorithm, showing how well the algorithm minimizes (or maximizes) the objective function over time.
scores$clhs_selected <- FALSE
scores$clhs_selected[selected_clhs$index_samples] <- TRUE
# Plot using ggplot2
ggplot(
scores[!scores$clhs_selected, ],
aes(x = PC1, y = PC2, color = clhs_selected, shape = clhs_selected)
) +
geom_point(size = 2, alpha = 0.7) +
geom_point(size = 3, data = scores[scores$clhs_selected, ]) +
theme_minimal() +
labs(x = "PC1",
y = "PC2",
color = "",
shape = "") +
scale_color_manual(
values = c("TRUE" = "red", "FALSE" = "black"),
labels = c("TRUE" = "Selected samples", "FALSE" = "Not selected samples")
) +
scale_shape_manual(
values = c("TRUE" = 16, "FALSE" = 16),
labels = c("TRUE" = "Selected samples", "FALSE" = "Not selected samples")
) +
theme(legend.position = "top") Conditioned Latin hypercube sampling (on the principal component space) for selecting the calibration subset.
The provided code uses ggplot2 to create a scatter plot
of principal component analysis results, highlighting samples selected
using Conditioned Latin Hypercube Sampling (cLHS). The
selected_clhs object stores indices of selected samples,
which are then marked in the scores data frame. In the plot
(Figure @ref(fig:clhsampling2)), selected samples are shown as larger
red circles, while non-selected samples appear as smaller black circles.
This visualization demonstrates how cLHS effectively selects a
representative subset within the PCA space, covering the spectral
variation of the dataset.
Throughout this document, we primarily use the organic carbon content as the main response variable and the derivative of the scatter-corrected absorbance spectra as the predictor variables.
Partial Least Squares Regression (Wold, Sjöström, and Eriksson 2001) (PLSR) is a statistical method used to model linear relationships between input variables and responses. It is particularly useful when the predictors are highly collinear or when the number of predictors exceeds the number of observations, as is often the case in soil spectroscopy. PLSR identifies relationships between two matrices (denoted as \(\mathbf{X}\) and \(\mathbf{Y}\)), such that the covariance between the dependent variables in \(\mathbf{Y}\) and the independent variables in \(\mathbf{X}\) is maximized. This is achieved through a dimensionality reduction technique similar to PCA, but while PCA focuses on explaining the variance in \(\mathbf{X}\) only, PLSR focuses on maximizing the covariance between \(\mathbf{X}\) and \(\mathbf{Y}\).
In statistical terms, let the decomposition of the \(\mathbf{X}\) and \(\mathbf{Y}\) matrices be expressed as:
\[ \mathbf{X} = \mathbf{T}\mathbf{P}^\top + \mathbf{E}, \]
\[ \mathbf{Y} = \mathbf{T}\mathbf{C}^\top + \mathbf{F}, \] where \(\mathbf{X}\) is the spectral data matrix, and \(\mathbf{Y}\) is the matrix of response variables. The matrix \(\mathbf{T}\) contains the scores (latent variables) for \(\mathbf{X}\), \(\mathbf{P}\) contains the loadings for \(\mathbf{X}\), \(\mathbf{C}\) contains the loadings for \(\mathbf{Y}\), and \(\mathbf{E}\) and \(\mathbf{F}\) are the residual matrices for \(\mathbf{X}\) and \(\mathbf{Y}\), respectively.
The scores matrix \(\mathbf{T}\) is calculated to maximize the covariance with the responses \(\mathbf{Y}\). This is achieved by regressing \(\mathbf{Y}\) on \(\mathbf{T}\) and using the regression coefficients \(\mathbf{B}\) to compute the predicted values of \(\mathbf{Y}\). The relationship can be written as: \[ \mathbf{\hat{Y}} = \mathbf{T}\mathbf{B}\mathbf{C}^\top, \]
where \(\mathbf{B}\) is obtained during the iterative process of maximizing the covariance between \(\mathbf{T}\) and \(\mathbf{Y}\). Once the regression coefficients are obtained, predictions can be made using this equation.
PLSR is perhaps the most widely used quantitative method in
spectroscopy in general. It is used to correlate the intensity at
various wavelengths with the concentrations of different chemicals or
physical soil properties. Beyond quantitative measures, PLSR can also
assist in qualitative analysis, such as identifying the presence of
specific compounds or determining the quality of a sample based on its
spectral signature. Below, we demonstrate an example of PLSR for
quantitative analysis using R.
We will use the pls package (Wehrens and Mevik 2007), which is designed
for partial least squares regression. If it is not already installed, we
will need to install it first. We extract the
Organic_Carbon as the response variable and the wavelength
variables as our predictors. We need to ensure that these predictors are
numeric and have no missing values. Using the pls package,
we fit a PLS regression model. The number of components to use can be
determined based on cross-validation, similar to PC analysis but this
time based on internal cross-validation results. For simplicity, we use
the functions provided by the pls package, but later in
this document, we also explain how to obtain validation statistics and
perform cross-validation on your own.
# Load necessary libraries
library(pls)
# Fit the PLSR model
plsr_model <- plsr(y ~ x, ncomp = 20, validation = 'CV')
# Summary of the model
summary(plsr_model)## Data: X dimension: 300 429
## Y dimension: 300 1
## Fit method: kernelpls
## Number of components considered: 20
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1.121 1.091 0.9829 0.9214 0.8863 0.8792 0.8574
## adjCV 1.121 1.092 0.9820 0.9172 0.8843 0.8726 0.8550
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 0.8526 0.8347 0.8289 0.8125 0.7990 0.8079 0.8042
## adjCV 0.8478 0.8302 0.8236 0.8061 0.7914 0.7972 0.7957
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 0.7992 0.7992 0.8021 0.8212 0.8084 0.8103 0.7986
## adjCV 0.7912 0.7897 0.7920 0.8089 0.7968 0.7987 0.7874
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 73.596 84.73 86.35 89.18 90.53 93.56 94.74 96.14
## y 5.985 26.26 39.71 43.26 47.98 49.21 53.80 57.29
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 96.85 97.26 97.47 97.80 98.28 98.57 98.75
## y 59.76 62.25 64.33 65.35 66.16 67.36 69.00
## 16 comps 17 comps 18 comps 19 comps 20 comps
## X 98.99 99.10 99.23 99.32 99.40
## y 69.72 70.92 71.64 72.31 73.26
The results show that using around 12 components is optimal for this
dataset, as it provides a low cross-validation error and a reasonable
explanation of the variance in Organic Carbon. Beyond this
point, the increase in the number of components does not significantly
improve the performance of the model and may lead to overfitting, as
indicated by the increase in Root Mean Square Error of Prediction
(RMSEP) values.
This analysis suggests that the PLSR model with around 10 components
effectively captures the relationship between the wavelengths and
Organic Carbon, making it a suitable model for prediction
within this dataset.
# Plotting
plot(
plsr_model,
plottype = "validation", main = NA,
xlab = "PLSR component", ylab = "Organic Carbon RMSEP (%)"
)
grid()Validation plot for the PLSR model.
Figure @ref(fig:plsrmsep) is showing the RMSEP on the y-axis against the number of components on the x-axis. This graph is typically used to determine the optimal number of components in a model by identifying the point where the RMSEP is minimized before it starts to increase, indicating overfitting. Here it confirms that the optimal number of components for PLSR is 10.
Finally, we evaluate the model using cross-validation and plot the results. This includes looking at validation statistics to assess the model performance. We can create plots to visualize i) the regression coefficients to see which wavelengths are most important and ii) the predicted vs. actual values to assess the model fit.
# Indicate the optimal number of PLSR components based on prior analysis
optimal_ncomp <- 10
# Extract the regression coefficients for the optimal number of components
selected_coeff <- plsr_model$coefficients[, , optimal_ncomp]
# Set up the plotting area to display two plots side by side
par(mfrow = c(1, 2))
# Plot 1: Regression coefficients against wavelengths
plot(
x = as.numeric(names(selected_coeff)),
y = selected_coeff,
type = "l",
xlab = "Wavelengths (nm)",
ylab = "Regression coefficients",
main = "Regression coefficients"
)
grid()
# Predict the response variable using the optimal number of components
y_hat <- predict(plsr_model, ncomp = optimal_ncomp)
# Plot 2: Predicted vs. Measured values
plot(
y_hat, y,
xlab = 'Predicted Organic Carbon (%)',
ylab = 'Measured Organic Carbon (%)',
xlim = range(y_hat, y),
ylim = range(y_hat, y),
pch = 16,
col = rgb(0.5, 0.5, 0.5, 0.7),
main = "Predicted vs measured"
)
abline(0, 1, col = "red") # Add a 1:1 perfect fit line
grid()
par(mfrow = c(1, 1))Left: Regression coefficients across wavelengths for the optimal 10 PLSR components. Right: Predicted vs. measured organic carbon values, with the red line indicating the ideal 1:1 fit.
The Figure @ref(fig:plscoeffval) (left) illustrates the PLSR regression coefficients across different wavelengths. The x-axis represents wavelengths ranging from 355 to 2495 nm, while the y-axis shows the corresponding regression coefficients, spanning from approximately -8 to 11. The fluctuations in the coefficients indicate the varying influence of different spectral regions on the predictions made by the model, with certain wavelengths playing a more pronounced role.
The Figure @ref(fig:plscoeffval) (right) is a scatter plot comparing the predicted organic carbon values against the measured values, based on the PLSR model with 10 components. The x-axis displays the predicted organic carbon percentage, while the y-axis represents the measured values. The points are generally aligned along the red 1:1 line, indicating that the model predictions are fairly accurate. However, some deviations from this line suggest prediction errors. This plot is essential for assessing the accuracy of the PLSR model in predicting soil organic carbon content.
Cubist (Quinlan et al. 1992) is a machine learning algorithm that is particularly popular in the field of soil spectroscopy for predictive modeling (Budiman Minasny and McBratney 2008). It combines aspects of decision trees with linear regression models to create rule-based ensemble models that are both interpretable and accurate.
The foundation of Cubist models consists of decision trees. However, unlike conventional decision trees that have terminal nodes with constant values, Cubist models employ linear regression models at these nodes. This approach allows the model to capture both non-linear relationships between variables through the tree structure and linear relationships within the subsets of data defined by the tree rules.
The key features of Cubist include:
it is a rule-based model that generates rules to split the dataset into subsets, with each rule corresponding to conditions based on the input features;
at each terminal node of these rules, a linear model is fitted to the subset of data that meets the rule conditions;
multiple models, also called committees, can be built and averaged to improve prediction accuracy;
Cubist can handle both continuous and categorical data, making it versatile for various applications.
In R, Cubist models can be trained using the
Cubist package (Kuhn et al. 2012).
# Load the packages
library(Cubist)
# Define the model using the caret package
cubist_fit <- cubist(y = y, x = x)
# Print the model summary
cubist_fit##
## Call:
## cubist.default(x = x, y = y)
##
## Number of samples: 300
## Number of predictors: 429
##
## Number of committees: 1
## Number of rules: 2
The output of the previous code describes a Cubist model fitted to a
dataset with 300 samples and 429 predictors, targeting the prediction of
Organic_Carbon. This model configuration uses a single
committee, meaning it consists of one ensemble model, and it has
generated two rules. These rules are derived from the predictors to
segment the data into subsets, within which linear regression models are
applied to predict the Organic_Carbon content. The fact
that only two rules were generated suggests that the relationship within
the data might be relatively simple and straightforward.
##
## Call:
## cubist.default(x = x, y = y)
##
##
## Cubist [Release 2.07 GPL Edition] Wed Jan 22 22:38:57 2025
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 300 cases (430 attributes) from undefined.data
##
## Model:
##
## Rule 1: [157 cases, mean 1.4381278, range 0.07154407 to 3.734808, est err 0.4129648]
##
## if
## 2260 <= -0.001437362
## then
## outcome = 1.28288 - 160.7 1375 + 108.1 2370 + 535 1095 + 51.7 2470
## - 493 1305 + 190 2260 + 563 1295 + 40.9 2015 - 392 1060
## + 135 2190 + 114 2125 + 29.3 2455 - 5.47 1885 - 126 2165
## - 90 2185 + 25.1 1415 + 102 2170 - 122 745 + 29.6 2335
## + 119 750 + 52 1560 - 25.2 2355 - 17 2485 - 125 1065
## - 142 1255 + 100 1300 + 10.8 1850 - 20 670 + 5 2175
##
## Rule 2: [143 cases, mean 2.4931519, range 0.1257046 to 4.915708, est err 0.6526790]
##
## if
## 2260 > -0.001437362
## then
## outcome = 0.3225504 + 133.3 2015 + 207.2 1435 - 227.5 1440 - 591 745
## + 575 750 + 114.4 2370 + 64.4 2455 - 12.01 1885 - 197 2185
## + 55.2 1415 + 89.5 2335 - 224 2170 + 213 2175 + 122 2190
## + 454 1295 - 57.3 1375 + 113 1560 - 55.3 2355 + 309 1095
## - 37.3 2485 - 275 1065 - 329 1300 - 76 2125 + 32.9 2380
## + 113 2165 - 312 1255 - 206 1140 + 230 1130 + 76 2260
## + 23.7 1850 - 40 860 - 44 670
##
##
## Evaluation on training data (300 cases):
##
## Average |error| 0.5737911
## Relative |error| 0.63
## Correlation coefficient 0.76
##
##
## Attribute usage:
## Conds Model
##
## 100% 100% 2260
## 100% 670
## 100% 745
## 100% 750
## 100% 1065
## 100% 1095
## 100% 1255
## 100% 1295
## 100% 1300
## 100% 1375
## 100% 1415
## 100% 1560
## 100% 1850
## 100% 1885
## 100% 2015
## 100% 2125
## 100% 2165
## 100% 2170
## 100% 2175
## 100% 2185
## 100% 2190
## 100% 2335
## 100% 2355
## 100% 2370
## 100% 2455
## 100% 2485
## 52% 1060
## 52% 1305
## 52% 2470
## 48% 860
## 48% 1130
## 48% 1140
## 48% 1435
## 48% 1440
## 48% 2380
##
##
## Time: 0.1 secs
The output describes a Cubist model trained on a dataset with 300 samples and 429 predictors, targeting the prediction of an outcome variable. The model comprises two rules:
Rule 1 applies to 157 cases and predicts the outcome using a linear combination of various spectral attributes. The mean outcome for this rule is approximately 1.44, with a range from 0.07 to 3.73, and an estimated error of about 0.41.
Rule 2 applies to 143 cases and also uses a linear combination of attributes for prediction, with a mean outcome of approximately 2.49, ranging from 0.13 to 4.92, and an estimated error of about 0.65.
The evaluation on the training data shows an average absolute error of about 0.57, a relative error of 0.63, and a correlation coefficient of 0.76, indicating a moderate linear relationship between the predicted and actual outcomes. The model uses all listed attributes extensively (100% usage), indicating their significant role in the model construction.
Random Forest (RF) (Breiman 2001) is an ensemble machine learning method used extensively in soil spectroscopy for predicting soil attributes from spectral data. In RF, multiple decision trees are built during calibration. The final output is the mode of the classes (for classification) or the mean prediction (for regression) of the individual trees. Individual trees in an RF are built using a technique called bootstrap aggregating (Breiman 1996), or bagging. Each tree is constructed from a different bootstrap sample of the data. A bootstrap sample is a random sample of the dataset with replacement, meaning the same data point can appear multiple times in the sample. At each node of the tree, a random subset of features is selected, and the best split on these features is used to split the node. This process introduces randomness into the model, which helps reduce variance and prevents overfitting.
The RF algorithm has three main parameters: Mtry, the number of trees, and the minimum node size.
Mtry determines the number of features to consider when looking for the best split at each node. Depending on the setting, it can be a fixed number of features, a percentage of the total features, or other variations like the square root or log of the total number of features. The choice of Mtry affects the diversity of the trees in the forest; lower values increase the diversity among the trees, helping to reduce model variance.
Number of trees defines the number of trees in the forest. Generally, more trees increase the performance and make the predictions more stable, but they also increase the computational cost. Typically, there is a point of diminishing returns where increasing the number of trees no longer improves the model significantly.
Minimum node size specifies the minimum number of samples required to split an internal node. Higher values prevent the model from learning overly specific patterns, thus lowering the risk of overfitting. However, too high a value can lead to underfitting, as the model might not be able to capture important distinctions in the data.
In R, the implementation of RF can be accessed through
several packages, with the most traditional being the
randomForest package (Breiman et al. 2018). It offers a
straightforward approach to both classification and regression tasks and
includes functionalities for assessing variable importance and
generating partial dependence plots. However, for larger datasets or
more computationally intensive tasks, the randomForest
package can be slow. To address these limitations, we will use the
ranger package (Wright and Ziegler 2017) for our
RF implementation. Ranger is optimized for speed and can handle larger
datasets more efficiently. It is written in C++ and offers
a faster execution time compared to the randomForest
package, which is especially beneficial when dealing with
high-dimensional data as in soil spectroscopy.
# Load the ranger package
library(ranger)
# Set seed for reproducibility (as our model has a random component)
set.seed(801124)
# Fit Random Forest model
rf_model <- ranger(
x = x, # Predictor variables
y = y, # Response variable
num.trees = 500, # Number of trees
importance = 'impurity' # Calculate variable importance
)
rf_model## Ranger result
##
## Call:
## ranger(x = x, y = y, num.trees = 500, importance = "impurity")
##
## Type: Regression
## Number of trees: 500
## Sample size: 300
## Number of independent variables: 429
## Mtry: 20
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.5576376
## R squared (OOB): 0.5545581
The code output indicates that our Random Forest regression model uses 500 trees, with 20 variables considered at each split (Mtry), determined by the square root of the total number of predictors. The model applies impurity-based variable importance and a variance split rule.
Key metrics include an Out-of-Bag (OOB) prediction error, reported as a Mean Squared Error (MSE) of 0.558 (squared units), which is equivalent to a RMSEP of 0.747 % (considering that \(RMSEP = \sqrt{MSE}\)) and an OOB R-squared of 0.555. These metrics indicate the ability of the model to explain variance in the target variable, with the OOB R-squared value showing that approximately 55.5% of the variance in organic carbon is explained by the model.
# Predicted vs. Measured values for Random Forest
plot(
rf_model$predictions, y, # Use RF model predictions
xlab = 'Predicted Organic Carbon (%)',
ylab = 'Measured Organic Carbon (%)',
xlim = range(rf_model$predictions, y), # Update x-axis limit
ylim = range(rf_model$predictions, y), # Update y-axis limit
pch = 16,
col = rgb(0.5, 0.5, 0.5, 0.7)
)
abline(0, 1, col = "red") # Add a 1:1 perfect fit line
grid() Random Forest predicted values vs. measured organic carbon values, with the red line indicating the ideal 1:1 fit.
Figure @ref(fig:rfval) shows the Random Forest predicted organic carbon values on the x-axis and the respective measured values on the y-axis, both ranging from 0 to 5 %. The data points are dispersed across the plot, generally clustering around the red diagonal line. This line represents the ideal 1:1 relationship where predicted values would perfectly match the observed values. The proximity of the points to this line provides a visual indication of the prediction accuracy, with points closer to the line representing more accurate predictions. The scatter of points around the line suggests some variability in the prediction accuracy, which is common in model evaluations.
In previous examples of model fitting for PLSR, Cubist and RF, we used the internal functions of the packages to return the prediction error of the model. In this section we will show you how to obtain this error from a validation dataset. This error will be used in the next section to calculate validation statistics.
Two common forms of validation are data splitting and cross-validation. Data splitting involves dividing the dataset into separate subsets, typically known as the calibration set and the validation set. The model is trained on the calibration set and then evaluated on the validation set to assess its performance on unseen data. This method helps in understanding how well the model generalizes to new data.
Cross-validation, on the other hand, is a more robust method where the data is divided into \(k\) subsets or folds. The model is trained on \(k-1\) folds and tested on the remaining fold, and this process is repeated such that each fold serves as the validation set once. This technique is particularly useful in soil spectroscopy for reducing variability and ensuring that the model performance is not dependent on a particular way of splitting the data.
Representation of the two forms of validation in infrared soil spectroscopy, by data spliting or cross-validating the original dataset.
Figure @ref(fig:CV) illustrate the two approaches for validation.
Data splitting: The original dataset is divided into distinct sections: a larger calibration set and a smaller validation set. Usually we take 70% for calibration and 30% for validation. The model is fitted on the calibration set and predicts the values of the target variable on the validation set. From the comparison on the validation set of observed and predicted values we can calculate validation statistics.
Cross-validation: The original dataset is divided into multiple equal segments, each labeled as a fold. We usually divide the dataset into 10 folds for 10-fold cross-validation. The image shows a cyclic process where each fold is used as a validation set in turn while the others are combined to form a calibration set.
In R data splitting can be done by manually splitting
the dataset into calibration and validation sets using base
R functions. Let us assume a 70-30 split.
set.seed(123) # Ensure reproducibility
# Randomly select a subset of row indices from the 'dat' dataset.
# 'sample' function is used to draw a random sample of indices.
# '1:nrow(x)' generates a sequence from 1 to the total number of samples.
# 'size' specifies the number of indices to sample (70% of the total samples).
random_indices <- sample(1:nrow(x), size = floor(0.7 * nrow(x)))
# Keep the calibration set for spectra and dependent variable
calibration_x <- x[random_indices, ]
calibration_y <- y[random_indices]
# Keep the validation set for spectra and dependent variable
validation_x <- x[-random_indices, ]
validation_y <- y[-random_indices]
# Fit a model on the calibration set
rf_model_cal <- ranger(x = calibration_x, y = calibration_y, num.trees = 500)
# Predict on the validation set
rf_y_hat <- predict(rf_model_cal, validation_x)$predictions
# Compare the observed and predicted values of the validation set
head(data.frame(observed = validation_y, predicted = rf_y_hat))## observed predicted
## 1 0.6957277 2.195611
## 2 1.4638274 2.517631
## 3 3.1562158 2.460884
## 4 3.5222743 2.058750
## 5 4.4331508 2.806809
## 6 1.7892846 2.520643
The previous code shows the process of creating a predictive model
using RF, specifically focusing on predicting organic carbon content
from spectral data. Initially, the code sets a random seed to ensure
that the results are reproducible. It then randomly splits the dataset
into a calibration set (70% of the data) and a validation set (30% of
the data) using random sampling. The calibration set is used to train
the model, where calibration_x contains the spectral data
and calibration_y contains the corresponding organic carbon
values. Similarly, validation_x and
validation_y are prepared for the validation set. A RF
model (rf_model_cal) is then fitted to the calibration data
using the ranger function, specifying 500 trees. After the
model is trained, it is used to predict the organic carbon content of
the validation set. Finally, the code compares the observed values of
organic carbon (validation_y) with the predicted values
(rf_y_hat) by displaying them side-by-side in a data frame
for the first few observations. This later information can be used to
calculate validation statistics.
Let us make the same with cross-validation by dividing the dataset into 10 parts and use each part as a validation set once, while the remaining parts serve as the calibration set.
nn <- nrow(x)
n_folds <- 10
folds <- sample(rep(1:n_folds, length.out = nn))
results <- vector("list", 10) # To store results from each fold
for(i in 1:n_folds) {
ith_validation_indices <- which(folds == i)
ith_calibration_indices <- setdiff(1:nn, ith_validation_indices)
# Keep the calibration set for spectra and dependent variable
ith_calibration_x <- x[ith_calibration_indices, ]
ith_calibration_y <- y[ith_calibration_indices]
# Keep the validation set for spectra and dependent variable
ith_validation_x <- x[ith_validation_indices, ]
ith_validation_y <- y[ith_validation_indices]
set.seed(123) # Ensure reproducibility
# Fit a model on the calibration set
ith_rf_model_cal <- ranger(
y = ith_calibration_y,
x = ith_calibration_x,
num.trees = 500
)
# Predict on the validation set
ith_predictions <- predict(ith_rf_model_cal, ith_validation_x)$predictions
# Store observed and predicted values
results[[i]] <- data.frame(
observed = ith_validation_y, predicted = ith_predictions
)
}
# Combine all results into one data frame
all_results <- do.call(rbind, results)
# Print the combined results to check
print(head(all_results))## observed predicted
## 1 0.7950426 1.988368
## 2 2.3388100 2.328392
## 3 1.1162509 2.511812
## 4 0.9845226 2.013226
## 5 4.4118067 3.181790
## 6 0.9764704 1.644237
The previous code performs a 10-fold cross-validation for a random
forest model using the ranger package to predict organic
carbon content from spectral data. Each fold is used once as the
validation set, while the remaining data serves as the calibration set.
The RF model is trained on the calibration set and then used to predict
the validation set. Observed and predicted values from all folds are
combined into a single data frame, all_results, which is
printed to show the initial results. This approach provides a thorough
evaluation of model performance.
After completing the cross-validation process, we can evaluate the quality of the predictions made by the model. This evaluation is typically done by performing a pairwise comparison of the observed and predicted values. While there are numerous validation statistics available, we will focus on explaining and calculating four main metrics: the mean error (ME), root mean squared error (RMSE), Pearson’s \(r\) correlation coefficient and R-squared (or modeling efficiency). These metrics together provide a comprehensive view of the model prediction quality.
The Mean Error (ME) evaluates the systematic deviation of the predictions from the observed values, indicating the bias of the predictions:
\[ ME = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i), \] where \(\hat{y}\) are the predicted values and \(y\) are the observed values. A ME close to 0 indicates predictions with low bias.
The Root Mean Squared Error (RMSE) measures the average magnitude of the prediction errors, providing an overall assessment of prediction accuracy:
\[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2} \]
Lower values of RMSE indicate better predictive accuracy. The RMSE has the unit of the target variable.
The Pearson’s \(r\) correlation coefficient measures the linear correlation between observed and predicted values, indicating the strength and direction of a linear relationship. It ranges from -1 to 1 and is calculated as:
\[ r = \frac{\sum_{i=1}^n (y_{i} - \overline{y})(\hat{y}_i - \overline{\hat{y}})}{\sqrt{\sum_{i=1}^n (y_i - \overline{y})^2} \sqrt{\sum_{i=1}^{n}(\hat{y}_i - \overline{\hat{y}})^2}}. \]
The R-squared (or modeling efficiency) assesses the goodness of fit, showing how well the variance in the observed data is explained by the model. It is obtained by:
\[ R^2 = 1-\frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \overline{y})}. \]
The R-squared has an ideal value of 1 and can be negative if the model is a worse predictor than using the mean of the data as predictor.
ss_res <- sum((all_results$observed - all_results$predicted)^2)
ss_tot <- sum((all_results$observed - mean(all_results$observed))^2)
r_squared <- 1 - (ss_res / ss_tot)We can create a summary of the four statistics:
validation_summary <- data.frame(
mean_error = mean_error,
rmse = rmse,
pearson_corr = pearsons_r,
r2 = r_squared
)
print(validation_summary)## mean_error rmse pearson_corr r2
## 1 0.01134008 0.7565291 0.7522722 0.5412878
The statistics reveal a Pearson’s correlation coefficient of 0.75, indicating a strong positive linear relationship between observed and predicted values. The R-squared value of 0.54 suggests that approximately 54% of the variance in the dependent variable is explained by the model. The Mean Error is minimal at 0.0113, implying that the model does not systematically overestimate or underestimate the observed values. However, the RMSE of 0.76 indicates moderate average errors in the predictions.
Uncertainty quantification in soil spectroscopy is crucial for several reasons. It provides a measure of confidence in the predictions made by spectroscopic models, which is essential for decision-making in agricultural management, environmental monitoring, and soil health assessment. Uncertainty quantification helps in understanding the reliability of the predictions and in determining the risk associated with decisions based on these predictions. Several methods exist for quantifying the uncertainty of prediction made by a model. Uncertainty can readily be obtained with PLSR and other statistical models. It is more complicated for machine learning models such as cubist and random forests, although solutions exist notably with bootstrapping (R. A. Viscarra Rossel 2007). Here we present a straightforward method to obtain uncertainty intervals using quantile regression forest.
Quantile regression forests (Meinshausen 2006) (QRF) is an extension of the RF algorithm, designed to provide estimates for conditional quantiles of the response variable, rather than focusing solely on the conditional mean. While RF predicts a single mean response for each set of input features by averaging the predictions of numerous decision trees, QRF uses the entire conditional distribution of the outcomes generated from the leaf nodes of the trees. This distribution is used to estimate different quantities and quantiles (e.g. the median, the quartiles). This makes QRF very useful in scenarios where understanding the variability of the prediction, such as the range or the likelihood of extreme values, is as important as predicting the mean. QRF provides estimates of the uncertainty of predictions, which is not typically available with machine learning algorithms. Hereafter we obtain the standard deviation of the prediction by sampling from the conditional distribution and taking the variance of all the samples.
library(matrixStats)
set.seed(801124) # Ensure reproducibility
# Fit the quantile regression forest (QRF) model
# `quantreg = TRUE` enables the quantile regression option
qrf_model <- ranger(x = calibration_x, y = calibration_y, quantreg = TRUE)
# Predict quantiles using the QRF model
# `type = "quantiles"` specifies quantile predictions
# `what = function(x) sample(x, 100, replace = TRUE)` simulates 100
# samples from the predicted quantiles
qrf_predictions <- predict(
qrf_model,
validation_x,
type = "quantiles",
what = function(x) sample(x, 100, replace = TRUE)
)
# Calculate the mean predictions from the 100 simulations
qrf_yhat <- rowMeans(qrf_predictions$predictions)
# Calculate the variance of predictions from the 100 simulations
qrf_yhat_variance <- rowVars(qrf_predictions$predictions)
# Create a dataframe for plotting, containing observed and predicted values
qrf_plot_data <- data.frame(observed = validation_y, predicted = qrf_yhat)
# Plot: Predicted vs. Measured values for quantile Random Forest
plot(
x = qrf_yhat, y = validation_y,
xlab = 'Predicted Organic Carbon (%)',
ylab = 'Measured Organic Carbon (%)',
xlim = range(qrf_yhat, y), # Set x-axis limit
ylim = range(qrf_yhat, y), # Set y-axis limit
pch = 16,
col = "white" # Initial plot with white color to add error bars first
)
# Add error bars representing the standard deviation of the predictions
arrows(
x0 = qrf_yhat - sqrt(qrf_yhat_variance),
y0 = validation_y,
x1 = qrf_yhat + sqrt(qrf_yhat_variance),
y1 = validation_y,
code = 3, angle = 90, length = 0.02, col = rgb(1, 0.65, 0, 0.7)
)
# Add the points for predicted vs. observed values
points(
x = qrf_yhat, y = validation_y,
pch = 16,
col = rgb(0, 0, 0, 0.7)
)
# Add a 1:1 perfect fit line (diagonal) in red
abline(0, 1, col = "red")
# Add a grid to the plot for better visualization
grid() Validation plot for the predicted organic Carbon values using quantile regression forests. The yellow lines indicate the standard deviation of the predictions.
The previous code demonstrates how to fit a QRF model using the
ranger package, with organic carbon as the response
variable and preprocessed spectra as the predictors. The model is set up
for quantile regression by enabling the quantreg = TRUE
argument. After the model is trained, it simulates the distribution of
predicted values by drawing 100 samples with replacement. The mean and
variance of these predictions are then calculated to summarize the
central tendency and uncertainty.
Figure @ref(fig:qfreds) shows the validation results where each black point represents a predicted sample value versus observed organic carbon values. The red diagonal line indicates where predictions would perfectly match the observations, serving as a reference for perfect model accuracy. The yellow error bars, extending from each dot, represent the standard deviation of the predictions, providing a visual measure of the uncertainty associated with each prediction. Longer error bars indicate greater uncertainty. This combined visualization helps to assess both the accuracy of the model and the confidence in its predictions, making it easier for readers, especially beginners, to understand the performance and reliability of the model.
In soil spectroscopy, identifying the most influential wavelengths (or wavenumbers) that contribute to model predictions is essential. We previously demonstrated this approach in the section on PLSR. Understanding which parts of the spectra are crucial for predicting a target variable not only aids in interpreting the results but also enhances model diagnostics. This understanding clarifies the underlying mechanisms driving the predictions and helps refine the model for improved accuracy and reliability.
Several methods exist for determining important wavelengths. In PLSR, regression coefficients can be used for this purpose. In RF, a variable importance metric is typically reported alongside the predictions. Here, we introduce a model-independent technique based on Shapley values. This approach can be applied to any model to report wavelength (or wavenumber) specific importance metrics.
Shapley values (Shapley 1953), derived from cooperative game theory, have been adapted for machine learning model interpretation. They allocate the “payout” (i.e., the prediction output) among the “players” (i.e., the features or inputs of the model). Each Shapley value quantifies the contribution of a single feature to the prediction of the model, considering all possible combinations of features. This is achieved by systematically evaluating the change in prediction that occurs when a feature is added to a subset of other features, averaged over all possible subsets. The result is a fair distribution of importance across all features, highlighting the impact of each feature on the output of the model. This method is particularly useful for complex models where feature interactions are non-trivial, as it provides clear, actionable insights into how and why certain features influence the predictions of the model (Wadoux 2023).
Hereafter we use the fastshap package (Greenwell and
Greenwell 2020), which provide a fast implementation for
estimating the Shapley values from Štrumbelj and
Kononenko (2014).
library(fastshap)
# Create custom predict function that returns the predicted values as a vector
predict_with_qrf <- function(model, newdata) {
results <- predict(model, newdata)$predictions
return(results)
}
# Calculate Shapley values using fastshap
# (NOTE that the following computations might take a while)
ishapley <- fastshap::explain(
qrf_model,
X = validation_x,
pred_wrapper = predict_with_qrf,
nsim = 50
)
# Aggregate Shapley values
# Create a data frame to store the aggregated Shapley values
# 'wavelength' is extracted from the column names of 'ishapley'
# 'importance' is calculated as the mean of the absolute Shapley values for
# each wavelength
s_importance <- data.frame(
# Convert column names (wavelengths) to numeric
wavelength = as.numeric(colnames(ishapley)),
# Compute mean absolute Shapley values
importance = apply(ishapley, MARGIN = 2, FUN = function(x) mean(abs(x)))
)
# Plot the aggregated Shapley values as an area plot
ggplot(s_importance) +
geom_area(
aes(x = wavelength, y = importance),
fill = "red", alpha = 0.6
) +
scale_y_continuous(expand = c(0,0)) +
labs(x = "Wavelength (nm)", y = "Absolute contribution")Wavelength contribution to organic Carbon predictiona in spectral data using quantile regression forests.
Our code uses a QRF model to predict organic carbon content from spectral data and to quantify the uncertainty of these predictions. Shapley values are computed for each wavelength to determine their contribution to the predictions. The final plot displays the absolute contributions of these wavelengths.
The Figure @ref(fig:shapley) illustrates the importance of different wavelengths (in nm) in predicting organic carbon. The x-axis represents the wavelengths used as input to the model, while the y-axis shows the absolute contribution of each wavelength, with values ranging from approximately 0 to 0.05. The plot is created by aggregating the Shapley values, which quantify the influence of each wavelength on the model’s predictions.
The data shows that while many wavelengths contribute minimally (with values close to 0), several distinct peaks exist, indicating that certain wavelengths have a significantly higher influence. Notably, a prominent peak is observed around 2260 nm, indicating this substantial impact of the wavelength on the predictions of the model. This relationship is confirmed by the literature which suggest association with known organo-mineral complexes (Stenberg et al. 2010), see also the next Section. Other smaller peaks are distributed across the spectrum, highlighting the non-uniform contribution of different spectral regions to the predictive model. These peaks reveal which parts of the spectrum are most critical for accurate predictions, offering insights into the model’s behavior and the underlying data characteristics.
Below is a brief summary of what properties can be estimated with infrared spectroscopy. For a more complete description the reader is redirected to Stenberg et al. (2010), José M. Soriano-Disla et al. (2014) and Fang et al. (2018).
Soil minerals Soil minerals are diverse and play a crucial role in soil composition. The most common soil minerals include clay minerals (e.g., smectite, vermiculite, illite, kaolinite) and iron oxides (e.g., hematite, goethite). These minerals result from the degradation of primary minerals, and their quantity depends on the chemical nature of the soil environment. Clay minerals exhibit distinct absorption patterns in the near-infrared (NIR) and mid-infrared (MIR) ranges, which can be attributed to specific chemical groups. see R. A. Viscarra Rossel et al. (2009) and Chukanov and Chervonnyi (2016). For instance, the hydroxyl (O-H) stretching and metal-OH bending modes are observed in the regions of 3400-3750 cm\(^{-1}\) and 600-950 cm\(^{-1}\), respectively. Silicate absorption occurs in the 700-1200 cm\(^{-1}\) region due to Si-O stretching, and in the 150-600 cm\(^{-1}\) region due to bending, which overlaps with Al-O stretching and bending.
Kaolinite, a 1:1 layer silicate, exhibits fundamental stretching and deformation vibrations in the MIR. It shows two or more O-H stretching bands in the range of 3600-3700 cm\(^{-1}\), with a characteristic doublet at 3700 and 3620 cm\(^{-1}\). The band near 920 cm\(^{-1}\) is related to Al-OH deformation vibrations, while the bands near 1115, 1033, and 1008 cm\(^{-1}\) are characteristic of Si-O vibrations. In the NIR, the well-known doublet absorption occurs near 2200 nm and 1400 nm, corresponding to the Al-OH bend and O-H stretch combination, and the overtone of the O-H stretch at 3620 cm\(^{-1}\), respectively.
The 2:1 layer alumino-silicates, such as smectite (e.g., montmorillonite) and illites, exhibit a single and broad absorption band between 3620 and 3700 cm\(^{-1}\), corresponding to the O-H stretching of structural hydroxyl groups. Montmorillonite smectites have resolved O-H deformation bands at 915 cm\(^{-1}\) (AlAlOH) and 840 cm\(^{-1}\) (AlMgOH), as well as an absorption near 3373 cm\(^{-1}\) due to the O-H stretching of water. Illite is characterized by the combination of the broad absorption near 3622 cm\(^{-1}\) and the bands near 825 and 750 cm\(^{-1}\).
Metal oxides, such as gibbsite, hematite, and goethite, also have characteristic absorption patterns in the MIR and NIR ranges. Gibbsite (Al(OH)3) has absorption bands near 3540 and 1000 cm\(^{-1}\) due to O-H stretching, while goethite (\(\alpha\)-FeOOH) and hematite (\(\alpha\)–Fe\(_2\)O\(_3\)) exhibit specific Fe-O and FeO-H absorbance in the 3400-3000 cm\(^{-1}\) and 900-700 cm\(^{-1}\) regions.
Soil carbonates, such as calcite or dolomites, have diagnostic absorptions in the MIR near 1450, 880, and 700 cm\(^{-1}\), corresponding to the vibrational modes of the free CO3\(^{2-}\) ion. The NIR contains the combinations and overtones of these fundamental CO3 vibrations, occurring near 2500 and 2330 nm.
Quartz, one of the few soil minerals that can be semi-quantitatively determined from its MIR spectra, has several sharp peaks in the range of 1100-1000 cm\(^{-1}\) and 700-800 cm\(^{-1}\) due to the fundamental O-Si-O stretching and bending. The doublet at 800 and 781 cm\(^{-1}\) is often used for identification.
Organic matter Soil organic matter (SOM) is the sum of all organic material found in the soil, and it plays a crucial role in various soil chemical and physical functions. The absorption of organic matter in different frequencies of the infrared spectrum has been extensively studied, allowing for the understanding of qualitative and quantitative properties of SOM. A review is made in Margenot, Parikh, and Calderón (2023).
In the mid-infrared (MIR) range, organic matter absorbance related to organic functional groups has several well-defined peaks that can permit semi-quantitative assessment of SOM chemistry, provided additional data on the soil matrix is available or mineral absorbance is removed. These peaks include amide N-H (3380 cm\(^{-1}\)), aliphatic C-H (2930, 1465 cm\(^{-1}\)), aromatic C=C (1520 cm\(^{-1}\)), phenol and carboxyl C-O (1275 cm\(^{-1}\)), and polysaccharide C-O (1170 cm\(^{-1}\)), see Margenot et al. (2017) and Demyan et al. (2012).
In the near-infrared (NIR) region, the absorption results from vibration of chemical bonds corresponding to several of the SOM chemical groups, such as C-H, N-H, and C-O. The ranges 1600-1900 and 2090-2630 nm contains most of the important SOM absorptions, but bands higher than 2000 nm can be confounded with several organic and inorganic molecules that absorb in this region.
Chemometric modeling has been widely used to extract the vast amount of SOM information contained in the NIR and MIR spectra of bulk soils. Studies have shown that both NIR and MIR spectroscopy can be used to predict organic matter and its compounds, such as organic C and total N, with high correlation between measured and predicted values (Dalal and Henry 1986; Ben-Dor, Inbar, and Chen 1997; Leslie J. Janik, Merry, and Skjemstad 1998).
To improve the characterization of SOM functional groups in the MIR range, strategies such as spectral subtraction techniques or chemical pre-treatment have been employed. Spectral subtraction involves removing the OM from the soil sample, followed by the subtraction of the scan of the treated soil sample from the original scan on the bulk soil. This enables better understanding of the peaks related to organic components of the soil sample. Alternatively, organic absorbance may be enhanced by demineralization procedures with hydrofluoric acid (HF), which has been shown to improve SOM characterization with MIR spectroscopy.
Macro- and micronutrients Soil spectroscopy has been widely used to estimate soil macronutrients, such as N, P, Ca, K, Mg, and S, due to their direct relationship with plant nutrition and soil fertility (Ng et al. 2022). These macronutrients, except for mineral N, have a relatively constant level over time, making them suitable for fertilizer optimization and precision agriculture.
Nitrogen (N) is the most important nutrient for plant growth and soil fertility. N is strongly correlated with soil organic matter (SOM) and the spectrally active amide groups in SOM. While total and organic N can be estimated accurately using NIR and MIR spectroscopy, estimating N fractions and mineral N is more challenging, with variable results reported in the literature.
Phosphorus (P) estimation has also been relatively well-studied, with a focus on extractable P. The variance explained in the existing studies on soil spectroscopy is variable, likely due to the different laboratory methods used to measure available P. Better results have been obtained for estimating P sorption and total P, although the latter is less useful for understanding plant nutrition.
Potassium (K) estimation has been less successful, as exchangeable K is influenced by properties that are not spectrally active in the NIR or MIR, such as K+ concentration in the soil solution. Overall, poor results are obtained when estimating K forms, with exchangeable K better predicted by NIR and total K better predicted by MIR.
Magnesium (Mg) and calcium (Ca) are spectrally active due to their association with carbonates and clay minerals. NIR and MIR spectroscopy can provide accurate estimates of exchangeable and total Mg and Ca, with MIR being more suited for total Mg and NIR for exchangeable Mg.
In contrast to macronutrients, soil micronutrients (B, Cl, Co, Cu, Fe, Mn, Mo, Na, Ni, and Zn) have been less studied, and their estimation using NIR and MIR spectroscopy is highly variable. Their estimation is often dependent on covariation with spectrally active soil compounds, and nutrients that depend more on the soil solution than the soil matrix are less likely to be accurately estimated.
pH and lime requirement Soil pH is a crucial indicator of potential soil fertility, influencing various soil processes such as mineralization and biological activity. When the pH is too low (acidic soils), liming is applied to neutralize sources of acidity. Both pH and lime requirement can be estimated using near-infrared (NIR) and mid-infrared (MIR) spectroscopy, which correlate the spectral response to soil attributes like organic matter and clay minerals.
The relationship between soil pH and infrared spectroscopy is complex, with conflicting results reported in the literature. Some studies found correlations between pH and bands associated with cation exchange capacity and clay minerals, while others did not find a correlation with clay. Carbonates were identified as an influential factor. The bands associated with pH-correlated soil attributes vary across different soil types.
Despite the complexity, soil pH can be accurately estimated using NIR and MIR spectra. Estimation of pH in CaCl2 is slightly more accurate than estimation of soil pH in water, with a median root mean square error of 0.29 for pH in water and 0.30 for pH in CaCl2 (Leslie J. Janik, Merry, and Skjemstad 1998).
Lime requirement, defined as the quantity of calcium carbonate needed to raise the pH of the first 10 cm by 1 pH unit, can also be estimated using NIR and MIR spectroscopy. MIR is generally more suitable as it can estimate quartz and organic matter more accurately. While the spectroscopic methods can estimate lime requirement accurately, the variable results are often attributed to large errors in estimating lime requirement in the lab rather than to the spectroscopy itself.
It is important to note that estimating soil pH and lime requirement with infrared spectroscopy should be done with caution, as excess lime can lead to trace element immobilization. The error in estimating pH is often around one-third of a pH unit, corresponding to a lime requirement of up to 10-15 tons of limestone per hectare in a clayey soil (Stenberg et al. 2010).
Soil physical properties play a crucial role in regulating various processes within the soil ecosystem, including compaction, erodibility, water and nutrient dynamics, and heat and air flow (José M. Soriano-Disla et al. 2014). These properties can be characterized by attributes such as soil texture fractions, soil structural characteristics, and soil water properties.
Soil aggregation, including the stability of aggregates and the distribution of micro- and macro-aggregates, has been estimated using near-infrared (NIR) and mid-infrared (MIR) spectroscopy (José M. Soriano-Disla et al. 2014). These techniques have shown promising results, with the variance accounted for ranging from 0.46 to 0.60 for NIR and 0.53 for MIR. MIR spectroscopy has been found to be more suitable for predicting soil aggregate properties compared to NIR.
Knowledge of soil water content is crucial for the development of effective irrigation systems and successful plant growth. Spectroscopic methods, including the mid-infrared (MIR), near-infrared (NIR), and visible-near-infrared (Vis-NIR) regions, have been used to predict gravimetric soil water content with moderate to high success (R\(^{2}\) values ranging from 0.83 to 0.89, and as high as 0.98 for the UV-Vis-NIR region) (Bouma 1989). The NIR region has slightly outperformed the other spectral ranges, with a median R\(^{2}\) of 0.89.
Soil water content at field capacity and wilting point has also been well predicted using infrared spectroscopy. Field capacity water content has been correlated with infrared spectra, with acceptable predictions using Vis-NIR (R\(^{2}\) = 0.65), very high accuracy using NIR (R\(^{2}\) = 0.86), and MIR (R\(^{2}\) = 0.92 and 0.81) (McBratney, Minasny, and Viscarra Rossel 2006). Soil water content at wilting point (1,500 kPa) has been predicted well using MIR (R\(^{2}\) = 0.89), but with less accuracy using Vis-NIR (R\(^{2}\) = 0.75).
The prediction of particle size composition, including clay, silt, and sand, has been an ongoing area of interest in infrared spectroscopy. Particle size distribution influences key soil processes, and clay content has a significant impact on soil structure and shrink-swell properties driven by clay type. Almost equivalent R\(^{2}\) values have been obtained for clay and sand using the MIR (median R\(^{2}\) of 0.80 and 0.83, respectively) and Vis-NIR (median R\(^{2}\) = 0.78 for both) regions. Lower median R\(^{2}\) values were generally observed for silt, with the best results given by Vis-NIR predictions (median R\(^{2}\) = 0.67) and slightly lower accuracy for MIR (median R\(^{2}\) = 0.63).
Specific surface area, another soil physical property, has been predicted using NIR and Vis-NIR with R\(^{2}\) values of 0.70 and 0.92, respectively. Bulk density, an important agronomic property, has been moderately predicted using Vis-NIR (R\(^{2}\) = 0.72) and MIR (R\(^{2}\)= 0.62).
Finally, spectroscopy has been coupled with other approaches, such as pedotransfer functions, to estimate soil physical properties, particularly those based on soil pore-space relationships (e.g., volumetric water retention, hydraulic conductivity, penetration resistance, and bulk density) (McBratney, Minasny, and Viscarra Rossel 2006). However, estimating soil physical properties based on soil pore-space relations (e.g., volumetric water retention, hydraulic conductivity, penetration resistance, and bulk density) is not usually well predicted directly with Vis-NIR spectroscopy (B. Minasny et al. 2008). There are exceptions, however, primarily due to strong correlations between basic soil properties and pore-space-related physical characteristics. Emphasis should be placed on utilizing both soil spectroscopy and pedotransfer functions together with soil spectral inference systems (McBratney, Minasny, and Viscarra Rossel 2006), i.e., using spectral inference systems to improve predictions and integrating those estimates into existing or newly calibrated pedotransfer functions. ### Biological Properties
The use of spectroscopic techniques, such as near-infrared (NIR) and mid-infrared (MIR) spectroscopy, has gained significant attention in the field of soil science for the assessment of soil microbial properties. These techniques offer a rapid, non-destructive, and cost-effective approach to quantifying various soil microbial parameters, including biomass, respiration, and enzyme activities.
The microbial biomass, which represents a small fraction (less than 5%) of the total organic carbon content in soils, has been successfully predicted using both NIR and MIR spectroscopy. Chang et al. (2001) reported an \(R^2\) of 0.6 for biomass C using NIR, while Terhoeven-Urselmans et al. (2006) achieved similar accuracy with an \(R^2\) of 0.56. However, higher accuracies have been reported for microbial biomass C predictions using MIR, with \(R^2\) values of 0.69 and 0.79, respectively, as reported by Leslie J. Janik, Merry, and Skjemstad (1998) and Ludwig et al. (2015). Additionally, Ludwig et al. (2015) were able to determine biomass N (\(R^2=0.79\)) and P (\(R^2=0.62\)) with moderate accuracy using MIR.
Regarding microbial activity, Chang et al. (2001) reported better success in predicting microbial respiration parameters using NIR, with \(R^2\) values of 0.72, 0.66, and 0.82 for mineralizable N, total respiration rate, and basal respiration rate (BSR), respectively. In contrast, Van Groenigen et al. (2003) found higher accuracy for mineralizable N using NIR (\(R^2=0.46\)) compared to MIR (\(R^2=0.21\)) in flooded soils. Similarly, Ludwig et al. (2015) reported an acceptable accuracy of \(R^2=0.63\) for BSR using MIR, while Zornoza et al. (2008) achieved excellent accuracy (\(R^2=0.92\)) for BSR with NIR. Meyer et al. (2018) utilized MIR for the prediction of soil respiration (\(R^2=0.83\)) and its sensitivity to temperature (Q10) in various climates.
Microbial biomarkers, such as phospholipid fatty acids (PLFAs), have also been successfully predicted using spectroscopic techniques. Zornoza et al. (2008) reported excellent accuracies for the prediction of several variables related to soil microbial groups based on PLFA biomarkers, including bacteria (\(R^2=0.93\)), Gram-positive bacteria (\(R^2=0.91\)), actinomycetes (\(R^2=0.92\)), vesicular-arbuscular mycorrhizal (VAM) fungi (\(R^2=0.91\)), and total PLFAs (\(R^2=0.91\)). Furthermore, Huleihel et al. (2018) demonstrated the potential of infrared spectroscopy for the detection and differentiation of fungi at the genus level.
The presence and activity of soil enzymes, which are closely linked to soil physico-chemical properties, have also been successfully predicted using spectroscopic techniques. Reeves I I I (2010) achieved \(R^2\) values within the range of 0.8 with NIR for the prediction of four enzymes: dehydrogenase, phosphatase, arylsulfatase, and urease. Mimmo et al. (2002) reported that MIR provided better accuracies for the prediction of phosphatase and arylsulphatase (\(R^2=0.46\) and 0.47, respectively) compared to NIR (\(R^2=0.59\)). Cohen, Prenger, and DeBusk (2005) achieved similar accuracy for the prediction of peptidase, dehydrogenase, phosphatase, and \(\beta\)-glucosidase (\(R^2=0.66\) - 0.79).
The integration of NIR predictions from various soil attributes, including organic matter, nutrient supply, and biological activity, has been explored by Cécillon et al. (2009) to assess the effects of wildfire and earthworms as biological indicators of soil quality.
Soil contamination is a significant environmental concern, with a wide range of natural and synthetic materials present in the soil matrix. The prediction and identification of these contaminants can be achieved using spectroscopic techniques, such as MIR and NIR spectroscopy.
Organic contaminants, including petroleum hydrocarbons, microplastics, and pesticides, can be detected and quantified using IR and NIR spectroscopy. Absorbance peaks in the NIR region \((2222 – 2439 \text{ nm})\) and the MIR region \((3000 – 2600 \text{ cm}^{-1})\) have been used to predict total petroleum hydrocarbon content with high accuracy \((R^2 = 0.93)\) (Forrester et al. 2013). The peaks in the MIR region are related to two \(-CH_3\) peaks \((2950 \text{ and } 2730 \text{ cm}^{-1})\), which are absent in natural soil organic matter absorption \(-CH_2\) region \((2930 – 2850 \text{ cm}^{-1})\). Similarly, the NIR region at \(2250 - 2350 \text{ nm}\) has been used to predict total petroleum hydrocarbon with good accuracy (R\(^{2} = 0.64\)) in moist intact soil (Chakraborty et al. 2015).
Polycyclic aromatic hydrocarbons (PAHs) have also been successfully predicted using NIR spectroscopy, with good to excellent prediction accuracies \((R^2 = 0.77 – 0.89)\) (Okparanma and Mouazen 2013). The absorption at \(1647 \text{ nm}\) is attributed to C-H stretching modes of aromatic hydrocarbons related to PAHs, while the absorptions around \(1712 \text{ and } 1752 \text{ nm}\) are related to C-H stretching modes of terminal \(CH_3\) and saturated \(CH_2\).
Microplastics, such as polyethylene terephthalate (PET) and low-density polyethylene (LDPE), can be identified using NIR spectroscopy. Absorptions between \(1130 - 1720 \text{ nm}\) in the NIR range have been reported for the identification of these plastic polymers, see Sato et al. (2003), Corradini et al. (2019), and Ng, Minasny, and McBratney (2020). Absorbance at \(1210, 1420,\) and \(1730 \text{ nm}\) is due to C–H \((CH_2)\) first and second overtones and C–H deformation, while absorbance at \(1660 \text{ nm}\) is related to overtone stretching of aromatic C-H groups, which is present in PET but not LDPE.
Pesticides, such as glyphosate, can also be predicted using NIR spectroscopy. Hermansen et al. (2020) successfully utilized NIR for predicting glyphosate sorption in soil \((R^2 = 0.93)\), with absorbances in the \(400–440, 560–600, 640–680, 1280–1320, 2160–2240,\) and \(2280–2360 \text{ nm}\) regions.
Although inorganic contaminants, such as heavy metals, are not spectrally active, their concentrations can be predicted using the relationship with clay and organic matter (Bray, Rossel, and McBratney 2009). Gholizadeh et al. (2020) reported significant correlations between heavy metals (Cu, Pb, Zn, and Cd) and clay and/or organic matter, which were predicted better than that of Mn (R\(^{2}\) up to \(0.6\)) using vis-NIR range. Similarly, Kemper and Sommer (2002) achieved high \(R^2\) accuracy for the prediction of As \((0.84)\), Fe \((0.72)\), Hg \((0.96)\), Pb \((0.95)\), and Sb \((0.93)\) using NIR spectroscopy. J. M. Soriano-Disla et al. (2013) predicted heavy metals elements determined by X-ray fluorescence spectroscopy for Pb, As, Cr, Zn, and Ni with moderate accuracy \((R^2\) ranging from \(0.42 – 0.73)\) using MIR spectroscopy.
Quantitative spectral data analysis requires statistical techniques to relate soil attributes to the spectral signature. Various methods have been used, the simplest of which are linear regression models fitted between soil attributes and specific wavelengths. For example, Dalal and Henry (1986) fitted a linear model on three user-defined wavelengths to predict soil moisture, organic carbon, and total nitrogen from the peak intensity. More elaborate techniques for spectral calibration and prediction are now routinely used. For example, principal component regression (PCR) and partial least-square regression (PLSR) use the full spectrum as predictor with dimensionality reduction techniques. Both techniques are insensitive to multi-collinearity between wavelengths. In PCR (Massy 1965), the spectra are decomposed into a set of eigenvectors and scores, which are then fitted with a linear model to the soil attributes. In PLSR, the decomposition is made in such a way that the correlation between the decomposed spectra and the soil property is exploited. The decomposed eigenvectors and scores tend to explain better the variation of the soil property, making the model more robust, faster to fit and easier to interpret than other, most complex models. PLSR models can now be fitted in nearly all software implementations embedded in spectrometers.
In recent years, data mining and machine learning techniques have also been widely adopted to relate spectral data to soil attributes. Examples of models are support vector machine (Deiss et al. 2020), neural networks (L. J. Janik, Forrester, and Rawson 2009), boosted regression trees (R. A. Viscarra Rossel and Behrens 2010), regression tree or random forest (Wadoux 2023). Machine learning techniques have the advantages that they can handle non-linear relationships between the spectra and soil attributes and are generally fast to calibrate. They also may perform better than linear models, in particular in complex and large soil spectral libraries. However, most techniques do not provide estimate of the uncertainty of the prediction and are rather difficult to interpret. Obtaining uncertainty in model prediction is possible when using linear models, or when combining calibration models with bootstrap aggregation (e.g. bagging-PLSR (R. A. Viscarra Rossel 2007)). However, this results in a substantial increase in computational resource requirements.
A topic that garnered much interest in the last decade is that of
local calibration of spectroscopic model. Global models refer to models
calibrated using all the available spectral data, whereas a local model
is calibrated using a subset of the spectral data. When local modelling
is made, the spectral data are split into several groups of similar
characteristics, and a model is fitted for each of these groups. Usually
a local model is more accurate than a global models, in particular for
large and complex spectral data, but calibration of local models may be
challenging. A popular method for local calibration is called the
memory-based learning (Ramirez-Lopez et al. 2013)
with implementation in R in the resemble package (Ramirez-Lopez et
al. 2016). In this method, the local data are selected based
on spectral characteristics (e.g. the spectra scores), using some
similarity and distance metrics. Examples of similarity and distance
metrics are Euclidean and Mahalanobis distance, the correlation
dissimilarity, the dot-product cosine distance and the spectral
information divergence. Another popular method is called rs-local (Lobsey et al.
2017), and works by iteratively resampling a spectral library
to calibrate a local model and make a prediction. While both the
memory-based learning and rs-local build a local model based on the
characteristics of the spectra, other methods rely on building
geographically local models or models developed for specific soil types
or vegetation communities. In each of the cases (i.e. geographically or
spectrally local), the complexity of the spectral data is reduced in the
local sample because each subset shares similar characteristics.
Approaches for local modelling were strongly advocated for in Raphael A. Viscarra Rossel et al. (2024) with an overview of
current localization methods, and the developement of transfer learning
methods. An example of transfer learning is provided in Padarian, Minasny, and McBratney (2019) where a global and
continental model is refined locally to predict soil properties from
their vis-NIR spectra.
A chemometric model built on spectra from one instrument is unlikely to be applicable to the spectra from another instrument, because of scanning protocol, lighting condition, instrumentation setting, sensors or white reference type (R. A. Viscarra Rossel et al. 2016). The spectra scanned with one instrument are also influenced by external factors such as moisture. These factors often vary from laboratory to laboratory. Similarly, two different instruments from the same manufacturer produce slightly different spectra. Calibration transfer functions enable the transfer of a chemometric model (e.g. PLSR) to spectra scanned with another instrument, by standardization of a subordinate spectra to a reference spectra. Commonly the reference spectra are the one used to calibrate the chemometric model. One of such approach is to align the spectral measurements to an agreed-upon and well-known (species and concentration) material, as is done in material internal soil standard (ISS) approaches. For example, Dor, Ong, and Lau (2015) proposed the soil samples from bright and homogeneous sand dunes situated along the coastline of Wylie Bay (WB) and Lucky Bay (LB) in southwestern Australia as globally reference measurements. Other methods called multivariate standardization exist, such as direct standardization or piecewise direct standardization (Wang, Veltkamp, and Kowalski 1991), and external parameter orthogonalization (Budiman Minasny et al. 2011). In each of these methods, multiple soil samples are scanned in two instruments, and a transfer matrix is estimated by the difference between the two set of spectra, for each spectral band. Once estimated, the subordinate (i.e. unstandardized) spectra are multiplied by the transfer matrix, which yield a spectra corrected for the difference of instrument. These techniques were evaluated in Pittaki-Chrysodonta et al. (2021), in which it was found that they are robust for transferring soil spectroscopy calibration models between instrument. They are, however, seldom used in soil infrared research because they require samples scanned in the various instruments. Recently, Mishra et al. (2020) developed techniques for transferring calibration models without standard samples, whereas Mishra et al. (2021) inform that standard samples are only one way of transferring calibration models and that technique working without standard samples hold promises in NIR spectroscopy.
Brendan Malone (Commonwealth Scientific and Industrial Research Organization Agriculture and Food Research Unit, Australia)
Yuxin Ma (New South Wales Department of Climate Change, Energy, the Environment and Water, Australia)
This course is a training resource on soil spectral data analysis, covering beginner to intermediate levels. It was developed by the Global Soil Laboratory Network Initiative on Soil Spectroscopy (GLOSOLAN-Spec) under FAO’s Global Soil Partnership (GSP), through the collaboration of soil spectroscopy experts from various institutions worldwide.
The GLOSOLAN-Spec Initiative and the authors extend their gratitude to FAO’s Soil Mapping for Resilient Agrifood Systems (SoilFER) project and the Vision for Adapted Crops and Soils (VACS) – initiated by the United States Department of State – for their support.
Special thanks also go to the Kellogg Soil Survey Laboratory of USDA-NRCS for providing the spectral datasets used in this document.